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Abstract 


This  document  proposes  an  interface  to  parallel  file  systems  intended  for 
use  with  a  variety  of  parallel  computers.  This  proposal  is  based  on  the  sepa¬ 
ration  of  programmer  convenience  functions  from  high-performance  enabling 
functions.  We  propose  that  the  former  be  supported  above  this  interface, 
possibly  in  client  libraries.  The  latter,  functions  that  enable  high  perfor¬ 
mance,  are  defined  by  this  proposed  API  under  the  assumption  that  these 
functions  are  more  likely  to  need  system  and  vendor-specific  support. 

Specifically,  this  proposal  includes  functions  which  support  reading  and 
writing  with  scatter-gather  addressing  for  memory  and  file  ranges,  and  asyn¬ 
chronous  operations.  It  also  includes  mechanisms  that  permit  client  control 
over  client  caching,  and  file  access  and  layout  hints.  Finally,  it  includes  a 
mechanism  by  which  this  API  can  be  extended  and  extensions  for  fast  file 
copy  and  batching  collective  I/O  operations. 
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1  Context 

2  This  proposal  is  being  developed  by  the  Scalable  I/O  Initiative  (SIO),  a  con- 

3  sortium  of  universities,  national  laboratories,  and  industries  studying  parallel 

4  and  scalable  I/O  systems  for  large  parallel  computers.  This  proposal  is  not  a 

5  commitment  on  the  part  of  any  member  of  SIO  to  support  these  interfaces. 

6  However,  it  is  intended  that  within  the  SIO  effort  several  implementations 

7  of  parallel  file  systems  compliant  with  this  interface  be  produced  on  several 

8  different  platforms.  We  do  not  expect  these  interfaces  to  be  finalized  until 

9  implementation  and  user  experience  are  obtained.  SIO  will  foster  such  im- 

10  plementation  and  application  development  experience.  The  ultimate  goal  of 

11  this  effort  to  produce  a  common  parallel  file  system  interface  is  two-fold:  to 

12  support  research  in  the  area  of  parallel  I/O,  and  to  eventually  recommend 

13  additions  of  parallel  I/O  interfaces  to  the  x/Open  and  POSIX  standards. 

14  This  document  contains  a  basic  API  plus  several  extensions.  Sections  3 

15  through  14  in  this  document  contain  the  basic  API,  which  all  conforming 

16  implementations  must  implement.  Sections  15  and  16  contain  extensions  to 

17  the  API  which  may  optionally  be  provided  by  implementations. 

18  Within  the  SIO  research  community,  proposals  (and  counterproposals)  for 

19  future  modifications  to  this  API  axe  journalled  in  a  separate  document  called 

20  “Proposal  for  a  Common  Parallel  File  System  Programming  Interface;  Part 

21  II:  What’s  in  Progress.” 

22  Perhaps  unavoidably,  this  document  is  more  about  the  description  of  inter- 

23  faces  than  it  is  about  their  rationalizations.  We  apologize  in  advance  for  your 

24  many  unanswered  questions. 
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2B  1  Introduction 

26  The  intent  of  the  interfaces  presented  here  is  to  add  to  the  standard  x/Open 

27  XPG  4.2  interfaces,  which  were  earlier  defined  in  IEEE  Standard  1003.1 

28  (POSIX).  It  is  widely  recognized  by  vendors  of  distributed  memory  parallel 

29  computers  and  workstation  clusters,  such  as  IBM  and  Intel,  that  extensions  to 

30  the  x/Open  XPG  4.2  and  POSIX  interfaces  to  support  high  performance  file 

31  I/O  for  parallel  applications  are  desirable.  However,  there  is  little  agreement 

32  about  what  these  extensions  should  be.  This  results  in  part  from  vendor 

33  extensions  that  exclusively  emphasize  the  capabilities  of  a  specific  machine 

34  or  application  class.  As  a  result,  it  is  not  currently  possible  for  programmers 

35  to  write  application  programs  using  extended  file  system  interfaces  that  are 

36  portable  from  one  parallel  computer  to  another. 

37  Clearly,  there  is  a  need  for  a  new  set  of  standard  interfaces,  preferably  a 

38  set  of  extensions  to  the  x/Open  XPG  4.2  interfaces,  if  we  wish  users  and 

39  third  party  software  vendors  to  use  the  extended  features  of  parallel  file 

40  systems.  The  SIO  community  has  chosen  to  divide  the  file  system  interface 

41  into  two  levels:  a  low-level  interface  which  hides  machine-dependent  details 

42  and  contains  only  those  features  needed  to  provide  good  performance,  and  a 

43  high-level  interface  which  provides  features  for  programmer  convenience  and 

44  to  support  particular  application  classes.^  This  document  describes  only  the 

45  low-level  interface. 

46  There  are  portions  of  this  API  which  provide  functionality  that  is  redundant 

47  with  the  function  provided  in  the  x/Open  interfaces.  This  is  to  enable  some 

48  SIO  members  to  develop  complete  experimental  file  systems  with  just  this 

49  API,  without  the  added  burden  of  implementing  a  complete  x/Open  com- 

50  pliant  file  systems  interface.  In  the  cases  of  redundant  interfaces,  the  SIO 

51  functions  can  simply  be  implemented  as  wrappers  over  the  standard  func- 

52  tions.  However,  these  functions  should  be  implemented  in  such  a  way  as  to 

53  ensure  that  all  libraries  written  to  this  API  can  run  properly. 

54  Our  two-level  approach  arises  from  the  conflicting  goals  of  some  aspects  of 

55  different  extended  interfaces.  For  example,  in  a  discussion  of  the  commonal- 

56  ities  between  IBM’s  PIOFS  and  Intel’s  PFS  in  February  1995,  we  identified 

57  little  more  than  the  basic  UNIX  functions  in  common.  Largely  this  is  be- 


^MPI-IO  is  an  example  of  such  a  high-level  interface. 
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58  cause  IBM  had  chosen  to  support  the  concept  of  dynamic  partitioning  and 

59  subfiles,  while  Intel  supported  a  set  of  file  modes  to  define  the  semantics  of 

60  parallel  access.  Our  two-level  approach  moves  the  implementation  of  the  spe- 

61  cial  character  of  these  parallel  file  systems  (Intel  I/O  modes  or  IBM  subfiles) 

62  to  high-level  libraries  and  proposes  a  low-level  interface  capable  of  efficiently 

63  supporting  both  of  these  and  other  specialized  parallel  file  system  function 

64  sets.  The  approach  follows  CMU’s  December  1994  suggestion,  in  that  the 

65  new  interfaces  are  low  level,  but  are  powerful  for  implementing  high-level 

66  parallel  I/O  libraries. 

67  The  usage  scenario  is  that  I/O  libraries  can  be  easily  and  efficiently  built  on 

68  top  of  the  interfaces  provided  by  this  API.  Each  vendor  is  free  to  implement 

69  whatever  libraries  they  wish  on  top  of  these  interfaces.  Likely  libraries  include 

70  MPI-IO,  a  PIOFS  subfile  library,  and  a  library  which  supports  Intel’s  I/O 

71  modes. ^  It  is  simpler  to  implement  or  share  a  library  at  this  level  than 

72  to  implement  the  function  in  the  vendor-specific  file  system  itself.  Also, 

73  third  party  vendors  (or  groups  such  as  SIO)  can  produce  libraries  that  could 

74  compile  and  run  on  another  vendor’s  machine.  In  addition,  these  interfaces 

75  could  be  a  compiler  target. 

76  Code  written  to  this  low-level  API  is  intended  to  be  portable.  By  this  we 

77  mean  source  compatibility.  In  particular,  each  implementation  of  this  API  is 

78  free  to  assign  different  bit  lengths  to  most  types  and  different  bit  values  to 

79  all  constants,  except  as  noted.  Because  the  size  of  fields  is  implementation 

80  dependent,  the  range  of  some  variables  may  also  vary.  In  some  cases  this 

81  may  limit  source  compatibility,  so  we  have  tried  to  require  comfortably  large 

82  limits  wherever  possible. 


83  1.1  Independent  Messaging  and 

84  Minimal  Synchronization 

85  One  view  of  a  parallel  application  is  of  a  set  of  tasks,  typically  executing 

86  on  different  nodes,  communicating  among  themselves,  possibly  via  shared 

87  memory.  There  are  a  variety  of  abstractions,  toolkits,  and  mechanisms  for 

88  communicating  from  which  a  particular  parallel  application  may  choose.  One 

89  principle  of  this  low-level  API  is  to  avoid  dependence  on  the  application’s 

^We  do  not  intend  to  prescribe  the  software  structure  of  an  implementation  of  PIOFS  or 
PFS  built  with  this  API.  Our  expectation  is  that  implementations  will  be  efficient  enough 
to  allow  libraries  built  entirely  on  the  interfaces  in  this  API  to  obtain  high  performance. 
For  example,  an  application  coded  for  an  SlO-based  Intel  I/O-mode  library  should  run 
efficiently  on  an  IBM  SP2  offering  these  interfaces.  Of  course,  when  this  application  runs 
on  a  Paragon,  it  is  not  required  to  use  the  I/O-mode  library  in  favor  of  the  native  PFS 
interfaces. 
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90  chosen  method  for  communication.  This  means  that  a  low-level  parallel  file 

91  system  client  implementation  may  not  be  aware  of  application-level  messages 

92  and  certainly  cannot  expect  to  use  the  same  methods  for  communicating  with 

93  its  peer  client  agents.  Of  course,  each  client  agent  of  the  low-level  parallel 

94  file  system  must  be  able  to  communicate  with  the  parallel  file  system  servers 

95  (if  any).  The  method  of  this  communication  is  implementation  specific  and 

96  will  most  likely  be  unavailable  to  the  application  programmer. 

97  Another  guiding  principle  in  the  design  of  this  API  is  to  discourage  unnec- 

98  essary  synchronization  of  the  client  applications  or  of  the  client  agents  of 

99  the  parallel  file  system.  To  this  end,  this  API  is  designed  to  admit  efficient 

100  low-level  parallel  file  system  implementations  which  restrict  internal  commu- 

101  nication  to  a  single  client  and  the  parallel  file  system  server(s)  responsible 

102  for  a  particular  file.  That  is,  this  API  does  not  require  that  client  agents  of 

103  the  parallel  file  system  directly  communicate.  This  means  that  a  compliant 

104  parallel  file  system  implementation  need  not  provide  coherent  distributed 

105  shared  memory,  shared  file  pointer  synchronization,  or  collective  I/O  bar- 

106  rier  synchronization.  As  described  below,  distributed  shared  memory  may 

107  be  avoided  with  application-managed  weakly  consistent  caches  and  collective 

108  I/O  barrier  synchronization  can  be  made  implicit  by  requiring  the  applica- 

109  tion  to  distribute  an  opaque  collective  I/O  handle  defined  by  the  parallel  file 

110  system. 


111  1.2  No  Shared  File  Pointers 

112  One  of  the  original  points  of  disagreement  in  the  development  of  the  API  was 

113  support  for  shared  file  pointers.  Some  parallel  file  systems  exploit  shared  file 

114  pointers  extensively  while  others  avoid  this  implicit  synchronization  as  much 
116  as  possible.  The  position  of  this  API  is  similar  to  the  latter:  that  shared 

116  file  pointers  can  require  extensive  synchronization  of  the  client  agents  of  the 

117  parallel  file  system;  that  they  implicitly  synchronize  the  application’s  tasks; 

118  and  that  they  can  easily  lead  to  excessive  synchronization,  slowing  the  appli- 

119  cation.  Further,  we  contend  that  if  this  level  of  application  synchronization 

120  is  valuable,  it  should  be  provided  by  the  higher  level  parallel  file  system  li- 

121  braries  which  may  have  access  to  peer-to-peer  messaging  systems  and  can  be 
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122  customized  to  specific  applications’  needs.  For  these  reasons  this  API  does 

123  not  support  shared  file  pointers;  in  fact,  it  does  not  support  file  pointers  at 

124  all,  requiring  the  offsets  for  all  I/O  operations  to  be  explicitly  provided. 

125  1.3  Scatter-Gather  Transfers 

126  Batching  transfers  is  a  powerful  strategy  for  improving  performance.  A  par- 

127  allel  file  system  implementation  can  be  expected  to  try  to  batch  accesses  to 

128  the  disk,  transfers  between  machine  nodes,  and  buffer  manipulations.  Tra- 

129  ditional  UNIX  read-write  interfaces  transfer  contiguous  file  regions  and  con- 

130  tiguous  memory  regions,  dramatically  reducing  batching  opportunities  for  ap- 

131  plications  that  manipulate  large,  non-contiguous  data  regions.  Correspond- 

132  ingly,  a  principle  extension  for  high-performance  file  systems  is  the  compact 

133  representation  of  transfers  of  non-contiguous  regions,  commonly  known  as 

134  scatter-gather.  In  the  core  of  this  API  proposal,  the  expressive  power  of 

135  scatter-gather  is  limited  to  a  list  of  strided  (vector)  regions.  ^ 

136  1.4  Asynchronous  I/O 

137  The  API  provides  interfaces  for  asynchronous  reads  and  writes.  Outstanding 

138  accesses  can  be  polled  or  waited  upon  (either  singly  or  as  a  list  of  accesses). 

139  1.5  I/O  Controls 

140  This  API  allows  applications  to  get  and  set  file  status  data  (such  as  file  sizes), 

141  get  and  set  performance-related  information  (such  as  file  caching  and  layout), 

142  and  perform  various  operations  (such  as  cache  consistency)  via  a  general  I/O 

143  control  mechanism.  Vendors  can  define  their  own  control  operations,  allowing 

144  the  API  to  be  extended  easily. 

145  Some  controls,  notably  data  layout  and  capacity  preallocation  controls,  may 

146  be  performed  much  more  efficiently  as  a  group  and/or  at  the  time  a  file  is 

147  created  or  opened.  For  this  reason,  multiple  controls  may  be  specified  in  the 

^Beyond  this  proposal,  some  SIO  researchers  have  shown  an  interest  in  nested  lists  of 
strided  regions. 
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148  same  operation,  and  the  extended  open  interface  in  this  API  allows  a  set  of 

149  controls  to  be  executed  when  a  file  is  opened.  Because  of  the  large  amount 

150  of  work  that  might  be  done  by  a  set  of  controls,  the  API  allows  failure  of  I/O 

151  controls  to  fail  the  overall  open  or  control  operation  immediately,  and  allows 

152  implementations  to  declare  that  certain  controls  may  not  be  issued  as  a  part 

153  of  the  same  operation. 


154  1.6  Client  Caching 

155  Because  parallel  files  will  experience  concurrent  read-write  shaxing,  main- 

156  taining  client  cache  consistency  could  become  quite  expensive.  An  imple- 

157  mentation  of  this  API  may  provide  no  client  caching  (for  example,  in  some 

158  parallel  systems  the  latency  for  fetching  a  file  block  from  a  server’s  cache 

159  may  be  low  enough  to  not  warrant  client  file  caches).  It  may  also  provide 

160  strong  consistency  using  shared  memory  mechanisms.  However,  many  paral- 

161  lei  applications  will  synchronize  concurrent  sharing  at  a  higher  level  and  can 

162  explicitly  determine  when  to  propagate  written  data  from  their  local  caches 

163  and  when  to  refresh  stale  data  from  their  local  caches.  This  API  enables  these 

164  applications  to  improve  their  client  cache  performance  by  requesting  weak 

165  consistency  on  a  particular  open  file  and  to  issue  the  appropriate  propagate 

166  and  refresh  controls.  In  the  case  of  weak  consistency,  an  implementation 

167  may  divide  the  file  address  space  into  fixed  sized  consistency  units  (cache 

168  lines  or  blocks)  which  are  entirely  present  in  a  client  cache  if  at  all.  Concur- 

169  rent  write  sharing  of  a  weakly  consistent  file  within  one  consistency  unit  is 

170  not  guaranteed  to  have  reasonable  semantics. 

171  Note  that  this  API  makes  no  requirement  that  a  low-level  parallel  file  sys- 

172  tern  implementation  control  or  even  detect  unintentional  read-write  sharing, 

173  that  is,  read-write  sharing  by  tasks  that  are  parts  of  multiple  uncoordinated 

174  parallel  applications.  In  situations  like  this,  which  are  common  to  many  file 

175  systems,  the  atomicity  of  file  creation  can  be  used  by  higher  level  tools  to 

176  provide  simple  advisory  locks  by  using  the  existence  of  a  file  to  signify  a  held 
lock. 


177 
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1  INTRODUCTION 


178  1.7  File  Access  Pattern  Hints 

179  Allowing  an  application  to  provide  hints  about  file  accesses  can  substantially 

180  improve  performance,  particularly  when  a  large  amount  of  data  is  read  non- 

181  sequentially  (but  predictably),  or  when  a  large  number  of  small  files  are  read 

182  one  at  a  time.  There  are  at  least  two  distinct  approaches  to  giving  hints:  ex- 

183  plicitly  listing  an  ordered  sequence  of  future  accesses  (such  as  “read  block  5, 

184  then  block  7),  and  describing  an  access  pattern  with  a  single  identifier  (such 

185  as  “random  access,” “sequential  access,”  or  “will  not  access”).  Because  it  is 

186  not  clear  how  to  interpret  a  set  of  hints  that  intermingle  these  approaches, 

187  this  API  provides  separate  hint  classes  for  each,  does  not  specify  how  to  in- 

188  terpret  combinations  containing  both,  and  allows  vendors  to  add  new  classes 

189  of  hints  as  needed.  To  allow  applications  to  provide  information  to  the  file 

190  system  as  early  as  possible,  hints  can  be  applied  to  open  file  descriptors  or 

191  to  files  that  have  not  yet  been  opened.  In  either  case,  hints  apply  only  to  the 

192  task  that  issued  them,  and  not  other  tasks. 


193  1.8  Extensions  to  this  API 

194  In  discussing  earlier  low-level  API  proposals,  we  found  that  there  are  some 
196  features  that  are  almost  universally  agreed  upon,  and  a  few  features  that 

196  have  significant  constituencies  but  were  not  supported  by  all  members  of  the 

197  group.  We  thus  chose  to  define  the  low-level  API  as  a  basic  API  plus  a  set 

198  of  optional  extensions.  An  extension  is  a  feature  that: 

199  •  has  significant  research  value; 

200  •  impacts  performance,  at  least  on  some  architectures;  and 

201  •  is  not  trivial  to  implement  correctly; 

202  As  a  part  of  the  basic  API,  implementations  must  provide  mechanisms  for 

203  allowing  applications  to  determine  which  extensions  are  supported.  Those 

204  mechanisms  are  detailed  in  Section  14. 


1.9  Collective  I/O 
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205  1.9  Collective  I/O 

206  As  mentioned  in  Section  1.3,  batching  is  a  powerful  mechanism  for  improving 

207  performance.  When  multiple  client  nodes  access  one  file  at  the  same  time, 

208  batching  can  again  be  useful,  particularly  when  each  client’s  access  is  a  com- 

209  plex  pattern  but  the  sum  of  all  client  accesses  is  a  large  contiguous  access 

210  (e.g.  the  whole  file).  Accesses  of  this  type  are  known  as  “collective  I/O,” 

211  and  this  API  includes  an  extension  which  provides  collective  I/O  facilities. 

212  Current  collective  I/O  mechanisms  commonly  exploit  the  implementation 

213  system’s  task  identifiers  or  task  groups  to  name  the  members  of  a  collective 

214  I/O.  In  this  API  we  avoid  dependence  on  the  systems’  task  naming  mecha- 

215  nisms  by  dynamically  defining  an  opaque  identifier  for  a  collective  I/O  that 

216  is  distributed  via  the  application’s  communication  system  and  presented  to 

217  the  parallel  file  system  by  each  participant  (client  involved  in  the  collective 

218  I/O).  With  this  mechanism  we  enable  at  least  three  types  of  batching.  First, 

219  the  parallel  file  system  implementation  may  choose  to  wait  for  all  partici- 

220  pants  to  join  the  collective  I/O  before  doing  any  of  the  work.  Second,  the 

221  application  can  provide  a  hint  describing  the  total  work  to  be  done  by  the 

222  collective  I/O  at  the  time  the  collective  I/O  is  defined.  Third,  a  collective 

223  I/O  may  be  defined  to  have  multiple  iterations,  avoiding  multiple  defining 

224  operations  and  enabling  earlier  collective  hints. 


225  1.10  Checkpoints  and  File  Versioning 

226  Many  parallel  applications  want  the  ability  to  create  checkpoints  of  their 

227  files.  Others  want  the  ability  to  efficiently  create  a  series  of  versions  of  a 

228  file  over  time.  Rather  than  directly  supporting  checkpoints  or  file  versions, 

229  this  API  includes  an  extension  which  offers  a  generic  “fast  copy”  operation. 

230  A  fast  copy  might  be  implemented  as  duplication  of  a  file’s  metadata,  with 

231  shared  pointers  to  all  data  pages,  each  of  which  is  marked  copy-on-write. 

232  The  tracking  of  copies  is  left  up  to  the  applications  (or  higher  level  parallel 

233  file  system  libraries). 
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234  1.1  1  File  Names  and  Access  Protection 

235  When  these  interfaces  are  merged  with  POSIX  it  is  expected  that  POSIX 

236  conventions  will  be  adopted  for  directories  and  access  control.  However, 

237  during  SIO  research,  compliant  implementations  need  not  deal  with  these 

238  (important)  issues. 

239  This  API  does  not  define  directories  or  directory  operations.  Files  may  be 

240  named  in  a  flat  name  space,  though  implementations  may  choose  to  offer 

241  additional  name  space  management.  A  directory  structure  is  not  viewed  as 

242  essential  to  parallel  file  system  performance  and  can  be  provided  by  vendor- 

243  defined  extensions  as  needed. 

244  Similarly,  access  control  checking,  permission  specifications,  and  user  and 

245  group  identifiers  are  not  specified  by  this  API.  Implementations  which  pro- 

246  vide  access  control  management  are  expected  to  do  so  via  vendor-defined 

247  extensions. 

248  1.12  File  Labels 

249  An  important  issue  for  higher  level  library  systems  and  application  systems 

250  is  interoperability.  To  support  interoperability  without  inserting  header  data 

251  into  the  file’s  actual  data,  the  low-level  API  was  offers  a  small  amount  of 

252  application  controlled  data  called  a  label  for  each  file.  A  file’s  label  is  stored 

253  in  its  metadata. 
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2  Document  Conventions 


256  This  document  describes  both  the  “basic”  (or  “core”)  API  and  extensions 

256  to  the  basic  API.  The  basic  API  is  described  in  Sections  3  through  14,  and 

257  the  extensions  are  described  in  Sections  15  and  16.  Some  sections  of  this 

258  document  refer  to  “this  document,”  which  is  meant  to  indicate  the  entirety 

259  of  the  basic  API  and  the  extensions  described  herein. 

260  Implementations  wishing  to  conform  to  this  API  must  provide  all  of  the 

261  types,  definitions,  and  functions  specified  in  the  basic  API,  including  those 

262  necessary  to  determine  whether  or  not  extensions  are  present. 

263  2.1  Typesetting  Conventions 

264  Type  definitions,  functions  definitions,  and  constants  (including  control  op- 
266  eration  identifiers)  are  typeset  in  the  bold  font. 

266  Function  names  are  typeset  in  the  bold  font  and  are  followed  by  parentheses, 

267  e.g.  sio_open(). 

268  Variables,  structure  members,  and  function  arguments  are  typeset  in  the 

269  italic  font. 


270  2.2  Definition  of  Terms 

271  Throughout  this  document  (except  where  explicitly  noted)  the  phrase  “file 

272  system”  is  used  to  indicate  a  file  system  which  provides  this  API,  and  “im- 

273  plementation”  is  used  to  refer  to  the  implementation  of  such  a  file  system. 

274  Except  where  noted,  the  terms  “application”  and  “higher-level  library”  are 

275  used  interchangeably,  and  are  meant  to  indicate  the  programs  or  libraries 

276  which  are  using  this  API  to  access  parallel  files. 

277  Throughout  this  document,  several  words  or  phrases  are  used  to  indicate 

278  how  given  functionality  must  be  used  or  implemented.  For  clarity,  they  are 

279  defined  here: 
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“will,”  “shall,”  or  “must” 

When  describing  functionality  provided  by  file  system  implementations, 
these  terms  indicate  that  conforming  implementations  have  to  imple¬ 
ment  the  functionality  as  described. 

When  describing  behavior  of  applications,  these  terms  indicate  the  be¬ 
havior  of  properly- written  applications  (i.e.  applications  behaving  in 
other  ways  are  considered  buggy). 

“should” 

When  describing  functionality  provided  by  file  system  implementations, 
this  term  suggests  that  an  implementation  provide  the  functionality  in 
the  manner  described,  but  that  doing  so  is  not  necessary  for  confor¬ 
mance. 

When  describing  behavior  of  applications,  this  term  indicates  that  the 
described  behavior  is  the  preferred  behavior,  but  that  other  behavior 
may  be  correct. 

“may” 

When  describing  functionality  provided  by  file  system  implementations, 
this  term  indicates  that  conforming  implementations  can  implement 
functionality  in  the  manner  described,  but  doing  so  may  not  be  sug¬ 
gested. 

When  describing  behavior  of  applications,  this  term  indicates  that  the 
described  behavior  is  allowed,  but  not  necessarily  encouraged. 

“undefined” 

Undefined  behavior  is  not  specified  by  this  standard,  and  is  usually  a 
result  of  a  programming  error  or  similar  problem.  Applications  must 
avoid  invoking  undefined  behavior.  File  system  implementations  may 
produce  completely  arbitrary  results  when  undefined  behavior  is  in¬ 
voked,  including  producing  random  data,  on  disk  or  in  memory  buffers 
provided,  or  generating  an  exception. 

“unspecified” 

Unspecified  behavior  is  not  specified  by  this  standard,  but  is  usually  the 
result  of  a  correct  programming  practice.  Behavior  is  left  unspecified  to 
give  file  system  implementations  freedom  to  implement  functionality  in 


2.3  How  to  Read  this  Document 
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313  different  ways.  Unspecified  behavior  must  not  have  harmful  permanent 

314  effects  on  the  application  or  its  data,  and  should  be  documented  in  in- 

315  dividual  implementations’  documentation.  Portable  applications  must 

316  not  rely  on  unspecified  behavior  causing  the  same  results  on  multiple 

317  file  system  implementations. 


318  2.3  How  to  Read  this  Document 

319  It  is  recommended  that  you  read  sections  6,8,9,10,11,12,  and  13  before  sec- 

320  tions  3,4,  and  5.  The  reason  for  this  is  that  sections  3,4,  and  5  provide 

321  definitions  which  refer  to  functions  explained  in  later  sections. 


2  DOCUMENT  CONVENTIONS 
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S22  3  The  sio_fs.h  Include  File 

323  File  system  implementations  must  provide  a  C  include  file  named  sioJFs.h 

324  which  contains  the  data  type  definitions,  constants,  and  function  declarations 

325  and/or  prototypes  for  all  functions  defined  in  this  document.  Implementa- 

326  tions  which  provide  extensions  not  defined  in  this  document  may  require 

327  additional  files  be  included  to  use  those  extensions.  Implementations  which 

328  do  so  must  still  define  the  extension  support  constants  and  extension  identi- 

329  fiers  (see  Section  14.1)  for  the  extensions  in  sio_fs.h. 

330  Applications  or  higher-level  libraries  must  include  sio_fs.h  in  their  source 

331  files  before  referencing  any  of  the  types,  constants,  or  functions  described  in 
this  API. 


332 
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333  4  Data  Types 

334  This  section  defines  the  data  types  which  are  referenced  in  the  basic  API,  and 

335  gives  brief  explanations  of  the  rationale  behind  them.  Types  used  exclusively 

336  by  extensions  are  not  defined  here-they  are  defined  with  the  extensions. 

337  All  of  the  types  defined  in  this  section  must  be  provided  by  conforming  im- 

338  plementations.  Vendors  may  provide  additional  types  with  names  of  the  form 

339  siojvend-vendordefinedname-t,  where  vendordefinedname  can  be  a  name  of 

340  the  vendor’s  choosing.  All  other  type  names  beginning  with  sio_  and  ending 

341  with  _t  are  reserved  for  future  use  by  this  API. 

342  Except  where  otherwise  noted,  the  sizes  of  all  non-structure  data  types  are 

343  fixed  on  a  per-implementation  basis  and  those  data  types  must  be  fully  copy- 

344  able  (i.e.  they  must  not  contain  any  pointers  to  other  objects). 


345  4.1  File  Descriptor 

346  All  file  descriptors  are  described  as  being  of  type  int,  primarily  for  compati- 

347  bility  with  other  systems  (including  UNIX)  which  use  ints  as  file  descriptors. 

348  A  task  may  have  up  to  SIO_MAX_OPEN  parallel  files  open  at  any  given 

349  time. 


350  4.2  File  Name 

351  All  file  names  are  character  strings  terminated  by  a  byte  with  the  value 

352  zero,  and  are  described  being  of  type  const  char  *.  (They  must  never  be 

353  modified  by  the  system,  and  thus  are  const.)  File  names  must  not  be  longer 

354  than  SIOJV[AX_NAME_LEN  characters,  including  the  terminating  zero 
byte. 


355 
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356  4.3  Memory  Address 

357  Memory  addresses  are  described  as  being  of  type  void  *.  Each  task  must 

358  only  access  its  own  or  a  shared  address  space.  Attempting  to  access  mem- 

359  ory  for  which  the  task  does  not  have  access  permission  produces  undefined 

360  results. 

361  4.4  sio_async_flags_t 

362  This  is  an  unsigned  integral  type  used  as  a  set  of  bits.  Currently  it  can 

363  contain  one  of  SIO_ASYNC_BLOCKING  or 

364  SIO_ASYNC_NONBLOCKING.  These  flags  indicate  whether  or  not 

365  sio_async_status_any()  will  block  waiting  for  an  asynchronous  1/ 0  to  com- 

366  plete.  The  use  of  these  flags  is  described  in  Section  10.2. 

367  4.5  sio_async_handle_t 

368  This  is  an  opaque  type  used  to  identify  asynchronous  I/Os. 

369  4.6  sio_async_status_t 

370  typedef  struct  { 

371  sio_transfer_len_t  count; 

372  sio_return_t  status; 

373  }  sio_async_status_t; 

374  This  structure  is  used  to  return  the  status  of  an  asynchronous  I/O.  For  a 

375  successful  operation,  count  is  set  to  the  number  of  bytes  transferred,  and 

376  status  is  set  to  SIO_SUCCESS.  For  an  unsuccessful  operation,  status  is 

377  set  to  a  value  which  indicates  the  nature  of  the  error,  and  count  is  set  to 

378  the  number  of  bytes  guaranteed  to  have  been  transferred  correctly  (see  Sec¬ 
tion  10.2). 


379 
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380  4.7  sio_caching_mode_t 

381  This  is  an  unsigned  integral  type  used  by  the  client  cache  control  interfaces, 

382  and  is  defined  in  Section  12. 


383  4.8  sio_control_t 


384 

385 

386 

387 

388 

389 


typedef  struct  { 

sio_control_flags_t  flags; 
sio_control_op_t  op-code; 
void  *op-data; 
sio_return_t  result; 

}  sio_control_t; 


390  This  type  is  used  to  store  the  information  associated  with  a  control  operation 

391  (see  Section  13).  Control  operations  are  specified  by  providing  the  appro- 

392  priate  operation  code  in  op-code,  an  indication  in  flags  of  what  to  do  if  the 

393  control  cannot  be  performed,  and  a  pointer  to  a  data  buffer  (if  necessary)  in 

394  op-data. 

395  The  result  field  is  set  by  the  function  performing  the  control  operation  to 

396  indicate  success  or  failure. 


397  4.9  sio_controLfiags_t 

398  This  is  an  unsigned  integral  type  used  as  a  set  of  bits.  Cur- 

399  rently  it  can  contain  one  of  SIO_CONTROL_MANDATORY  or 

400  SIO_CONTROL_OPTIONAL.  These  flags  indicate  whether  failure  of 

401  this  control  operation  will  cause  the  entire  set  of  control  operations  to  fail, 

402  with  semantics  as  described  in  Section  8.1. 
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403  4.10  sio_control_op_t 

404  This  is  an  unsigned  integral  type  used  to  indicate  a  control  operation  code. 

405  Control  operations  codes  which  are  part  of  the  basic  API  are  defined  in 

406  Section  13. 

407  4.  11  sio_count_t 

408  This  is  an  unsigned  integral  type  with  the  range  [0. . .  SIO_MAX_COUNT]. 

409  It  is  used  to  represent  a  quantity  of  objects. 

410  4.12  sio_extensionJd_t 

411  This  is  an  unsigned  integral  type  used  to  contain  extension  identifiers.  See 

412  Section  14.1.2  for  more  details  about  its  use. 


413  4.13  sio_flle_ioJist_t 

typedef  struct  { 

sio_ofFset_t  offset] 
sio_size_t  size; 
sio_size_t  stride] 
sio_count_t  element-cnt] 

}  sio_fileJoJist_t; 

420  This  structure  is  used  to  describe  a  collection  of  regions  within  a  file  that 

421  is  involved  in  a  parallel  file  system  operation.  Its  purpose  is  to  encapsulate 

422  the  description  of  many  simple  transfers  into  one  larger  and  more  complex 

423  transfer  to  enable  the  file  system  to  be  more  efficient  in  the  execution  of 

424  the  total  transfer.  Each  sio_file_ioJist_t  structure  describes  a  sequence  of 
equally-sized  and  evenly-spaced  contiguous  byte  regions  within  a  file;  this  is 


414 

415 

416 

417 

418 

419 


425 
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426  sometimes  called  a  “strided”  access  pattern.  Common  matrix  decompositions 

427  can  be  described  with  such  data  structures. 

428  The  structure  describes  a  set  of  element-cnt  contiguous  regions,  each  of  size 

429  size,  with  the  first  region  beginning  at  offset  offset  from  the  beginning  of  the 

430  file,  and  the  beginning  of  each  subsequent  region  starting  stride  bytes  after 

431  the  start  of  its  predecessor.  These  contiguous  byte  regions  may  overlap;  see 

432  Section  9  for  details. 


433  4.14  sioJiint-t 


434  typedef  struct  { 

435  sio_hint_flags_t  flag; 

436  sio_fileJoJist_t  *ioJist; 

437  sio_count_t  listJen; 

438  void  *  arg; 

439  sio_size_t  argJen; 

440  }  sio_hint_t; 


441  This  structure  is  used  to  store  hint  information  (see  Section  11).  The  flag 

442  field  describes  the  access  patterns  being  hinted,  and  the  ioJist  and  listJen 

443  fields  describe  the  regions  of  the  file  to  which  the  hint  applies.  The  arg  and 

444  argJen  fields  contain  a  pointer  to  a  hint-specific  argument  and  the  (non- 

445  negative)  length  of  the  argument,  respectively.  These  fields  allow  different 

446  types  of  hints  to  require  different  types  of  arguments,  while  using  the  same 

447  hint  interfaces. 


448  4.15  sioJiint_class_t 

449  This  is  an  unsigned  integral  type  which  contains  the  class  identifier 

450  of  hints  passed  with  the  sio_hint()  and  sio_hint_by_name()  functions. 

451  Each  class  of  hints  contains  one  or  more  hint  types  whose  interaction 

452  is  specified.  Interactions  between  hint  types  of  different  classes  are  un¬ 
specified.  This  document  defines  the  SIO_HINT_CLASS_ORDERED 


453 
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and  SIO_HINT_CLASS_UNORDERED  constants  to  describe  manda- 

455  tory  hint  classes,  and  reserves  constants  whose  names  begin  with  with 

456  SIO_HINT_CLASS_VEND_  for  use  by  vendors.  See  Section  11  for  more 

457  details  about  hints  and  hint  classes. 


458  4.16  sioJiint .flags _t 


459  This  IS  an  unsigned  integral  type  used  as  a  set  of  bits.  It  is  used  to  describe 

460  the  hint  information  stored  in  a  sioJiint.t.  See  Section  11  for  a  list  of 

461  possible  values  for  this  type  and  explanations  of  their  use. 


462  4.17  sioJabeLt 

463  typedef  struct  { 

«4  sio.size_t  size] 

465  void  *data] 

466  }  sioJabeLt; 

467  This  type  is  used  to  store  a  file  label,  which  can  contain  application- 

468  managed  descriptive  information  about  its  associated  file.  The  data  field 

469  points  to  a  memory  buffer  size  bytes  long.  The  SIO_CTL_GetLabel  and 

470  SIO_CTL_SetLabel  control  operations  use  this  structure  in  different  man- 

471  ners;  see  Section  13.9  for  more  information  about  this  structure’s  use. 


472  4.18  sioJayout.t 

473  typedef  struct  { 
sioJayout Jlags.t  flags] 
sio_count_t  stripe-width] 
sio.size_t  stripe-depth] 
sio_layout_algorithni_t  algorithm] 
void  *  algorithm-data] 


478 
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479  }  sio_layout_t; 


480  The  number  of  parallel  storage  devices  over  which  the  file’s  data  are  striped 

481  is  contained  in  the  stripe-width  field,  while  the  (non-negative)  number  of 

482  contiguous  bytes  stored  on  each  device  (the  unit  of  striping)  is  contained 

483  in  stripe-depth.  The  stripe-width  does  not  include  any  devices  containing 

484  redundancy  information,  such  as  ECC  codes  or  duplicate  copies  of  the  data. 

485  The  algorithm  field  indicates  the  style  of  layout  used  for  the  file  to  provide 

486  guidance  in  the  interpretation  of  the  stripe-width  and  stripe-depth  fields.  The 

487  algorithm-data  field  is  used  to  store  algorithm-specific  information  about  the 

488  layout. 

489  The  flags  field  indicates  which  portions  of  the  sioJayout_t  structure  are 

490  being  provided  to  the  system  or  should  be  filled  in  by  the  system  as  described 

491  in  Section  13.8. 


492  4.19  sio  Jayout  _algor it  hm_t 

493  This  is  an  unsigned  integral  type  whose  value  indicates  the  style  of 

494  layout  used  for  an  SIO  file.  The  layout  algorithm  describing  a  sim- 

495  pie  round-robin  striping  across  all  storage  devices  used  for  a  file  is 

496  SIO_LAYOUT_ALGORITHM_SIMPLE_STRIPING.  This  must  be 

497  defined,  though  not  necessarily  supported,  by  all  implementations.  Imple- 

498  mentations  may  choose  to  support  additional  layout  algorithms  that  describe 

499  layouts  in  more  detail  or  provide  for  more  complex  storage  system  architec- 

500  tures.  The  algorithm-data  field  in  the  sio  Jayout.t  structure  can  be  used  to 

501  store  additional  information  about  the  layout  algorithm. 

502  Layout  algorithm  names  beginning  with 

503  SIO_LAYOUT_ALGORITHM_VEND_  are  reserved  for  use  by  vendors. 

504  All  other  names  beginning  with  SIO_LAYOUT_ALGORITHM_  are  re¬ 
served  for  future  use  by  this  API. 


505 
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506  4.20  sio_layout_flags_t 

507  This  is  an  unsigned  integral  type  used  as  a  set  of  bits.  It  may  contain 

508  zero  or  more  of  SIO_LAYOUT_WIDTH,  SIO_LAYOUT_DEPTH,  or 

509  SIOXAYOUT-ALGORITHM,  bitwise  ORed  to  specify  the  fields  of  an 

510  sioJayout_t  structure  are  to  be  returned  or  set. 


511  4.21  sio_memJoJist_t 

typedef  struct  { 
void  *addr, 
sio_size_t  size] 
sio_size_t  stride] 
sio_count_t  element-cnt] 

}  sio_mem_ioJist_t; 

518  This  type  is  similar  to  sioJile_ioJist_t  except  that  it  describes  a  collec- 

519  tion  of  regions  within  one  memory  space  that  is  involved  in  a  parallel  file 
550  system  operation,  rather  than  a  collection  of  file  regions.  Its  purpose  is  to 

521  encapsulate  the  description  of  many  simple  transfers  into  one  larger  and  more 

522  complex  transfer  in  order  to  enable  the  file  system  to  be  more  efficient  in  the 

523  execution  of  the  total  transfer.  Each  sio_mem_ioJist_t  structure  describes 

524  a  sequence  of  equally-sized  and  evenly-spaced  contiguous  byte  regions  within 

525  the  memory  space. 

526  The  structure  describes  a  set  of  element.cnt  contiguous  regions,  each  of  size 

527  size,  with  the  first  region  beginning  at  address  addr,  and  the  beginning  of 

528  each  subsequent  region  starting  stride  bytes  after  the  start  of  its  predecessor. 

529  These  contiguous  byte  regions  may  overlap;  see  Section  9  for  details. 


512 
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530  4.22  sio_mode_t 

531  This  is  an  unsigned  integral  type  used  as  a  set  of  bits  to  specify  the  mode 

532  of  a  file  operation.  For  example,  the  mode  flags  SIO_MODE_READ  and 
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533  SIO_MODE_WRITE  can  be  specified  together  or  separately  to  open  the 

534  file  for  reading  and/or  writing,  or  to  indicate  what  operation  is  being  hinted. 

535  Other  flags  are  documented  in  Section  8.2. 

536  4.23  sio_ofFset^t 

537  This  is  a  signed  integral  type  whose  absolute  value  is  in  the  range 

538  [0. . .  SIOJVEAX-OFFSET].^  This  type  is  signed  to  allow  an  offset  vari- 

539  able  to  be  decremented  in  a  loop,  and  have  the  loop  terminate  when  the 

540  variable  becomes  negative. 

541  4.24  siojreturn.t 

542  This  is  an  unsigned  integral  type  used  by  functions  in  this  API  to  return  a 

543  result  code.^  The  constant  SIO_SUCCESS,  whose  value  must  be  0,  denotes 

544  success. 

545  Other  values  indicate  specific  errors  which  have  been  encountered  in  pro- 

546  cessing  this  request  (the  enumeration  of  standard  error  codes  is  included  in 

547  Appendix  A).  Error  code  names  beginning  with  SIO^ERR_VEND>.  may  be 

548  used  by  vendors  for  vendor-specific  error  codes.  Ail  other  error  code  names, 

549  beginning  with  SIO_ERR  are  reserved  for  future  use  by  this  API.  At  least 

550  16384  error  codes  (including  0,  for  SIO_SUCCESS)  must  be  available  for 

551  use  by  this  API. 

^We  do  not  take  advantage  of  the  defined  behavior  of  C,  which  allows  the  effect  of  neg¬ 
ative  signed  numbers  to  be  achieved  by  using  large  unsigned  numbers  that  are  congruent 
modulo  2” .  2^^  —  1  is  a  sufficiently  large  offset  that  the  extra  factor  of  2  possible  by  using 
unsigned  offsets  is  not  expected  to  be  important  before  machines  with  128  bit  word  sizes 
become  widely  used  for  high  performance  computing. 

®  An  earlier  version  of  this  document  used  UNIX-style  returns,  where  0  indicated  suc¬ 
cess,  and  -1  indicated  failure,  with  specific  UNIX  error  codes  being  set  in  the  global  error 
register.  This  was  deemed  inappropriate  for  two  reasons.  One  is  that  the  values  of  UNIX 
error  numbers  vary  from  platform  to  platform,  as  does  the  specific  list  of  errors  available. 
Another  more  serious  problem  is  that  it  is  difficult  for  multi-threaded  applications  to  ex¬ 
press  different  errors  to  different  callers  using  a  single  global  error  register.  Some  systems, 
such  as  pthreads,  provide  a  thread-specific  error  register  for  this  reason.  This  was  also 
deemed  unacceptable,  because  it  would  require  the  parallel  file  system  to  be  aware  of  the 
threading  mechanism. 
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552  4.25  sio_size_t 

553  This  type  is  used  to  describe  sizes  of  file  and  memory  regions.  It  is  a  signed 

554  integral  type  whose  absolute  value  is  in  the  range  [0. . .  SIOJMAX-SIZE]. 
655  It  is  signed  to  allow  expression  of  reverse  strides  for  operations  such  as 

556  sio_sg_read(). 

557  4.26  sio_transferJen_t 

558  This  is  an  unsigned  integral  type  in 

559  the  range  [0. . .  SIO_MAX_TRANSFER_LEN].  It  is  used  to  count  the 

560  total  number  of  bytes  transferred  in  I/O  operations.  This  type  differs  from 

561  sio_size_t  in  that  a  single  I/O  operation  may  transfer  many  buffers  whose 

562  length  is  represented  by  sio_size_t,  hence  sio_transfer_len_t  is  needed. 
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563  5  Range  Constants 

564  This  section  describes  the  constants  used  in  this  basic  API  to  specify  the 

565  ranges  of  data  types.  These  constants  are  implementation-specific.  However, 

566  for  each  of  them,  both  a  minimum  value  and  a  recommended  value  are  given. 


567  5.1  SIO_MAX_ASYNC_OUTSTANDING 

568  This  constant  specifies  the  maximum  number  of  outstanding  asynchronous 

569  I/O  requests  that  one  task  can  have  at  one  time.  The  minimum  value  is  1, 

570  and  the  recommended  value  is  512. 


577  5.2  SIO_MAX_COUNT 

572  This  constant  specifies  the  maximum  number  of  items  that  can  be  defined 

573  by  an  sio_count_t.  The  minimum  value  is  2^®  —  1,  and  the  recommended 

574  value  is  2^^  —  1. 


575  5.3  SIO_MAX_LABEL_LEN 

576  This  constant  specifies  the  maximum  length  of  a  file  label.  The  minimum 

577  value  is  SIO_MAX_NAME_LEN  (whose  minimum  value  is  256  bytes). 

578  The  recommended  value  is  the  maximum  of  1024  and  the  implementation’s 

579  value  of  SIO_MAX_NAME_LEN. 


580  5.4  SIOJMAXJNfAMEJLEN 

581  This  constant  specifies  the  maximum  length  of  a  file  ncime.  The  minimum 

582  value  is  256,  and  the  recommended  value  is  1024. 
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583  5.5  SIO_MAX_OFFSET 

584  This  constant  specifies  the  maximum  value  for  a  file  offset.  The  minimum 

585  value  is  2®^  —  1,  and  the  recommended  value  is  2®3  -  1. 


585  5.6  SIO_MAX_OPEN 

587  This  constant  specifies  the  maximum  number  of  open  files  that  a  task  can 

588  have  at  one  time.  The  minimum  value  is  256,  and  the  recommended  value  is 

589  512.  Note  that  a  task  may  still  fail  to  open  a  file  before  reaching  this  number 

590  because  of  system  resource  exhaustion. 


591  5.7  SIO_MAX_SIZE 

592  This  constant  specifies  the  maximum  size  in  bytes  of  a  variety  of  objects 

593  in  the  API.  The  minimum  value  is  2^^  —  1,  and  the  recommended  value  is 

594  2®^  —  1. 


595  5.8  SIO_MAX_TRANSFERJLEN 

596  This  constant  specifies  the  maximum  number  of  bytes  that  can  be  transferred 

597  by  a  single  I/O  operation.  The  minimum  value  is  SIO_MAX_SIZE,  and  the 

598  recommended  value  is  2®^  —  1.  Since  several  components  of  a  scatter-gather 

599  I/O  list  can  be  transferred  at  once,  SIO_MAX_TRANSFER_LEN  must 
be  greater  than  or  equal  to  SIO_MAX_SIZE. 


600 


33 


601  6  File  Attributes 


602  This  section  describes  the  attributes  associated  with  an  SIO  file.  The  file 

603  attributes  are  unique  to  each  SIO  file  and  visible  to  all  tasks  opening  the 

604  file.  These  attributes  include  the  logical,  physical,  and  preallocation  sizes  of 
606  the  file,  file  label,  and  file  layout  information.  Extended  controls  may  define 
606  additional  file  attributes. 


607  6.1  File  Sizes 


608  The  logical  size  of  an  SIO  file  is  the  number  of  bytes  from  the  begin- 

609  ning  of  the  file  (offset  zero)  to  the  end  of  the  file  (the  largest  offset  from 

610  which  data  can  be  read  successfully).  The  file  may  contain  regions  which 

611  have  not  yet  been  written  (referred  to  as  “holes”),  which  are  read  as  ze- 

612  ros.  The  logical  size  can  be  increased  or  decreased  with  the  control  oper- 

613  ation  SIO_CTL_SetSize  (see  Section  13).  Decreasing  the  logical  size  via 

614  SIO_CTL_SetSize  corresponds  to  truncating  the  file,  and  increasing  it  cre- 
616  ates  a  hole  extending  from  the  previous  end  of  file  to  the  new  end  of  file.  A 

616  file’s  logical  size  can  also  be  increased  by  writing  data  past  the  current  end 

617  of  file. 

618  The  physical  size  of  an  SIO  file  is  the  amount  of  physical  storage  in  bytes 

619  allocated  to  store  the  file  data  (excluding  metadata).  It  may  be  different 

620  from  the  logical  size  of  the  file  because  of  fixed  size  allocation  blocks  and  be- 

621  cause  each  implementation  has  the  freedom  to  store  data  in  any  appropriate 

622  manner,  including  not  storing  the  content  of  holes  and  the  use  of  data  com- 

623  pression  techniques.  The  user  has  no  direct  control  over  the  file’s  physical 

624  size. 

626  The  preallocation  size  of  an  open  SIO  file  is  the  minimum  logical  size 

626  to  which  the  file  system  guarantees  the  file  may  grow  without  running  out 

627  of  space.  When  a  file  is  opened  (created),  its  preallocation  size  defaults 

628  to  its  physical  size  (zero)  unless  a  SIO_CTL_SetPreallocation  control 

629  operation  (see  Section  13)  is  specified  in  the  sio_open()  call.  Prealloca¬ 
tion  size  is  not  affected  by  any  operation  defined  by  this  API  other  than 
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6S1  SIO_CTL_SetPreallocation  control  operation  and  sio_close(). 


632  6.2  File  Label 

633  The  file  label  of  an  SIO  file  is  a  part  of  the  file’s  metadata  that  is  acces- 

634  sible  to  the  user  for  storing  descriptive  information  about  the  file  without 

635  keeping  a  header  in  the  file  itself.  Labels  are  intended  to  support  interop- 

636  erability  by  associating  information  about  a  file’s  representation  (including 

637  file  type,  version,  writing  application,  etc)  with  the  file  itself.  Labels  are  not 

638  necessarily  the  same  length  in  all  implementations,  but  must  always  be  long 

639  enough  to  record  a  maximum  length  file  name  for  that  implementation.  This 

640  allows  representation  information  too  large  to  fit  in  a  file  label  to  be  stored 

641  in  a  separate  file  named  in  the  file  label.  The  size  of  a  label  is  given  in  the 

642  sio_label_t  containing  the  label.  This  size  is  at  least  as  large  as  an  im- 

643  plementation’s  longest  name  which  must  be  at  least  256  bytes.  The  maxi- 

644  mum  size  of  a  label  in  any  specific  implementation  is  given  by  the  constant 

645  SIO_MAX_LABEL_LENGTH  and  is  recommended  to  be  at  least  1024 

646  bytes. 

647  6.3  File  Layout 

648  The  file  layout  of  an  SIO  file  expresses  the  placement  of  the  file  bytes  on 

649  the  parallel  storage  devices.  Some  implementations  may  allow  the  user  to 

650  specify  the  file  layout  when  the  file  is  created  with  the  SIO_CTL_SetLayout 

651  control  operation.  Other  implementations  may  allow  the  user  to  query  the 

652  file  layout  parameters  with  the  SIO_CTL_GetLayout  control  operation, 

653  but  not  to  set  the  layout.  Still  others  may  choose  not  to  reveal  anything 

654  about  the  underlying  file  layout  and  will  support  neither  of  the  layout  control 

655  operations. 

656  A  given  file  layout  consists  of  the  number  of  parallel  storage  devices  over 

657  which  the  file  data  are  striped,  the  number  of  contiguous  bytes  constituting 

658  each  striping  unit,  and  the  algorithm  which  specifies  the  striping  pattern  of 

659  the  striping  units.  For  example,  a  simple  striping  pattern  on  four  storage 
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660  devices  using  a  striping  unit  of  1024  bytes  would  look  like  the  following  (the 

661  starting  byte  number  of  each  striping  unit  is  shown): 


Storage  Storage  Storage  Storage 

Unit  0  Unit  1  Unit  2  Unit  3 


0 

4096 

8192 

12288 

1024 

5120 

9216 

13312 

2048 

6144 

10240 

14336 

3072 

7168 

11264 

15360 

16384 

20480 

24576 

28672 

17408 

21504 

25600 

29696 

18432 

22528 

26624 

30720 

19456 

23552 

27648 

31744 

• 

• 

• 

• 

663  Note  to  implementor.  The  underlying  implementation  may  employ  advanced 

664  redundancy  encodings  or  dynamic  data  representation  (compressed  and  un- 

665  compressed  or  mirrored  and  parity  protected).  In  cases  like  these,  these 

666  layout  parameters  may  be  insufficient.  In  these  cases  the  width  of  a  stripe 

667  should  be  interpreted  as  the  parallelism  of  accesses  of  at  most  an  aligned 

668  striping  unit. 
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669  7  Error  Reporting 

670  To  make  it  easier  for  applications  to  deal  with  SIO  error  codes,  the  function 

671  sio_error_string()  is  provided.  This  function  takes  a  sio_return_t  value 

672  and  returns  a  const  char  *.  The  sio .error .string  function  maps  error  codes 

673  to  meaningful  error  strings.  When  passed  an  error  code  that  is  not  defined 

674  by  the  implementation,  sio_error.string()  must  return  a  string  indicating 
676  the  error  number  and  noting  that  the  error  code  is  unrecognized. 
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676  7.1  sio .error -String 

677  Purpose 

678  Translate  a  sio_return_t  into  a  string. 

679  Syntax 

680  ^include  <sio_fs.h> 

681  const  char  *sio_errorj5tring(sio_return_t  Result)', 

682  Parameters 

Result  The  return  code  to  translate. 

Description 

This  function  translates  a  return  code  to  a  string.  The  string  pointed 
to  must  not  be  modified  by  the  program,  and  may  be  overwritten  by 
subsequent  calls  to  sio_error_string().  If  the  implementation  supports 
NLS  (the  suite  of  internationalization  functions  mandated  by  x/Open 
XPG  4.2),  the  contents  of  the  returned  error  message  string  should  be 
determined  by  the  setting  of  the  LC_MESSAGES  category  in  the 
locale. 
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692  8  Basic  Operations 

693  This  section  defines  the  basic  operations  that  can  be  performed  on  parallel 

694  files.  Interfaces  are  provided  to  open  and  close  parallel  files,  to  remove  files 

695  from  a  parallel  file  system,  and  to  perform  control  operations  on  parallel  files. 

696  This  section  defines  some  operations  that  appear  to  be  similar  to  functions 

697  already  supported  in  the  POSIX  standard.  These  operations  exist  so  that 

698  implementations  of  this  interface  can  be  written  without  having  to  imple- 

699  ment  the  entire  POSIX  interface.  Implementations  that  do  support  complete 

700  POSIX  interfaces  must  still  support  the  functions  in  this  section,  although 

701  their  implementation  may  use  the  POSIX  functions. 

702  Three  of  the  functions  defined  in  this  section,  sio_open(),  sio_control(), 

703  and  sio_test(),  allow  the  application  to  specify  a  set  of  controls  to  be  applied 

704  to  a  file.  Because  sio_control()  provides  the  simplest  introduction  to  the 
706  use  of  controls,  it  is  described  first. 
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40  8  BASIC  OPERATIONS 

8.1  sio  .control 

Purpose 

Perform  a  set  of  control  operations  on  a  given  file. 

Syntax 

^include  <sio_fs.h> 

sio_return_t  sio_control(int  FileDescriptor,  sio_control_t  *Ops, 

sio_count_t  OpCount); 


Parameters 

FileDescriptor  The  file  descriptor  of  the  open  parallel  file  on  which  to 
perform  the  control  operations. 

Ops  An  array  of  control  operations  to  be  performed. 

OpCount  The  number  of  control  operations  in  the  array  referenced  by 
Ops. 

Description 

This  function  performs  the  set  of  control  operations  specified  by  the 
Ops  argument  on  the  open  file  specified  by  the  FileDescriptor  argu¬ 
ment.  Each  control  operation  is  either  mandatory  or  optional,  de¬ 
pending  on  the  bits  set  in  its  flags  field.  If  any  of  the  mandatory 
operations  would  fail,  the  sio_control()  operation  fails  and  returns 
SIO  JJRR-CONTROLJFAILED.  In  contrast,  the  failure  of  an  op¬ 
tional  control  does  not  cause  sio_control()  to  fail.  The  status  of  the 
individual  controls  can  be  checked  after  sio_control()  returns,  via  the 
result  field  in  the  sio_control_t  structures. 

The  application  must  not  assume  any  ordering  on  the  execution  of  the 
controls  in  Ops;  the  implementation  is  free  to  examine  and/or  execute 
the  Ops  in  any  order.  Those  control  operations  that  succeed  may  take 
effect  in  any  order. 

If  the  sio_control()  operation  succeeds,  then  all  of  the  mandatory 
controls  take  effect  and  have  their  result  codes  set  to  SIO -SUCCESS. 
With  regard  to  the  optional  controls,  one  of  two  situations  can  occur: 
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•  all  of  the  optional  controls  take  effect  and  have  their  result  codes 
set  to  SIO_SUCCESS;  or 

•  at  least  one  of  the  optional  controls  fails  and  has  its  result  code 
set  to  a  control-specific  error  value.  The  remainder  of  the  optional 
controls  may  individually  1)  fail  and  have  their  result  code  set  to  a 
control-specific  error  value,  2)  take  effect  and  have  their  result  code 
set  to  SIO_SUCCESS,  3)  not  be  attempted  and  have  their  result 
code  set  to  SIO_ERR_CONTROL_NOT_ATTEMPTED. 
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If  the  sio_control()  operation  fails  for  any  reason,  then  all  of  the 
control  operations  in  Ops  are  annulled,  that  is,  they  have  no  per¬ 
manent  effect  on  the  file  system.  If  sio_controI()  fails,  none  of 
the  controls  will  have  their  result  field  set  to  SIO_SUCCESS.  In 
this  case,  the  implementation  may  set  the  result  field  of  a  partic¬ 
ular  control  to  a  control-specific  error  code  if  that  control  would 
have  failed  or  if  the  control  caused  the  sio_control()  to  fail,  or  to 
SIO_ERR-CONTROL_WOULD_HAVE_SUCCEEDED  if  that 
control  would  have  succeeded  had  the  sio_control()  operation  not 
failed,  or  to  SIO_ERR_CONTROL_NOT_ATTEMPTED  if  the 
sio_control()  failed  before  the  implementation  checked  whether  or  not 
that  control  would  have  succeeded. 

Section  13  defines  the  control  operations  included  in  the  basic  API. 


757  Return  Values 


758 

759 

760 

761 

762 

763 

764 

765 

766 

767 

768 


SIO_SUCCESS 

All  mandatory  control  operations  succeeded. 

SIO_ERR_CONTROL_FAILED 

At  least  one  of  the  mandatory  control  operations  failed. 

SIO_ERR_CONTROLS_CLASH 

Some  of  the  mandatory  control  operations  are  incompatible  with 
each  other  and  cannot  be  performed  together  by  this  implementa¬ 
tion.  If  a  control  operation  fails  with  this  error,  then  at  least  two 
of  the  individual  control  operations  must  also  have  their  result 
fields  set  to  SIO_ERR_CONTROLS_CLASH. 
SIO_ERR_INVALID_DESCRIPTOR 

The  FileDescriptor  parameter  is  not  a  valid  file  descriptor. 
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8.2  sio_open 

Purpose 

Open  a  file  for  reading  and/or  writing. 

Syntax 

^include  <sio_fs.h> 

sio_return_t  sio_open(int  * FileDescriptorPtr,  const  char  *Name, 

sio_mode_t  Mode, 
sio-control-t  *  ControlOps, 
sio_count_t  ControlOpCount)", 


Parameters 

FileDescriptorPtr  On  success,  this  will  contain  the  file  descriptor  of  the 
newly  opened  file. 

Name  The  name  of  the  file  to  open.  The  name  must  be  at  most 
SIO_MAX_NAME_LEN  characters  in  length. 

Mode  The  mode  used  to  open  the  file.  Must  include  at  least  one 
of  SIO_MODE_READ  and  SIO_MODE_WRITE,  or  both 
ORed  together.  May  also  include  SIOJVIODE_CREATE. 

ControlOps  An  array  of  control  operations  to  be  performed  on  the  file 
during  the  open. 

ControlOpCount  The  number  of  operations  in  the  array  specified  by 
ControlOps. 

Description 

This  function  takes  a  logical  file  name,  and  produces  a  file  de¬ 
scriptor  which  supports  reading  and/or  writing,  depending  on  the 
value  of  Mode.  If  the  named  file  does  not  exist  and  Mode  has  the 
SIO_MODE_CREATE  bit  set,  then  the  file  will  be  created;  if 
the  bit  is  not  set  then  SI0_ERR_FILEJN10T_F0UND  will  be  re¬ 
turned.  If  SIO_MODE_CREATE  is  set  and  the  file  already  exists, 
SIO_ERR_ALREADY_EXISTS  will  be  returned. 
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799  As  part  of  the  operation  of  opening  the  file,  sio_open()  performs  the 

800  control  operations  described  by  ControlOps  and  ControlOpCount  The 

801  control  operations  have  the  same  meaning  and  are  treated  in  the  same 

802  way  as  in  the  sio_control()  function. 

803  If  the  sio_open()  operation  fails  for  any  reason,  then  all  of  the  control 

804  operations  are  annulled  and  have  their  result  codes  set  in  the  same  way 

805  sio_control()  sets  the  result  codes  when  it  fails. 

806  Note  that  the  semantics  of  sio_open()  do  not  require  any  permission  or 

807  security  checks.  Implementations  not  embedded  in  a  POSIX  file  system 

808  that  wish  to  provide  file  permissions  can  check  those  permissions  on 

809  open  and  can  allow  those  permissions  to  be  set  via  implementation- 

810  specific  control  operations. 

811  Return  Codes 

SIO_SUCCESS 

The  open  succeeded. 

SIO_ERR_ALREADY_EXISTS 

SIO_MODE_CREATE  was  specified  and  the  file  already  exists. 

SIO_ERR_CONTROL_FAILED 

At  least  one  of  the  mandatory  control  operations  would  have 
failed. 

SIO_ERR_CONTROLS_CLASH 

Some  of  the  mandatory  control  operations  specified  are  incompat¬ 
ible  with  each  other  and  cannot  be  performed  together  by  this 
implementation. 

SIO_ERR_FILE_NOT_FOUND 

The  file  did  not  exist  and  SIO_MODE_CREATE  was  not  spec¬ 
ified. 

SIO_ERRJNVALID_FILENAME 

The  Name  parameter  is  not  a  legal  file  name. 

SIO_ERR_IO_FAILED 

A  physical  I/O  error  caused  the  open  to  fail. 


812 
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828 
829 
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SIO_ERR_MAX_OPENJEXCEEDED 

Opening  the  file  would  result  in  the  task  having  more  than 
SIO_MAX_OPEN  open  file  descriptors. 
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8.3  sio_close 

Purpose 

Close  a  previously  opened  file. 

Syntax 

^include  <sio_fs.h> 

sio_return_t  sio_close(int  FileDescriptor)', 

Parameters 

FileDescriptor  The  file  descriptor  of  the  open  parallel  file  to  close. 

Description 

This  function  closes  an  open  file.  All  resources  associated  with  having 
the  file  open  will  be  deallocated.  Cached  pending  writes  are  made 
visible  to  other  nodes  before  sio_close()  returns  (see  Section  12  for 
details).  The  results  of  any  asynchronous  I/Os  in  progress  at  the  time 
sio_close()  is  called  are  unspecified,  and  the  handles  for  those  I/Os 
may  be  invalidated  by  the  system.  Applications  may  ensure  that  all 
asynchronous  I/Os  are  complete  by  calling  sio_async_status_any() 
prior  to  calling  sio_close()  (see  Section  10.2).  Pre-allocated  space, 
unnecessary  for  the  physical  file  associated  with  the  open  file,  may  be 
released. 

Note  to  implementors:  Implementations  should  close  all  of  a  task’s 
open  parallel  file  descriptors  when  the  task  terminates. 

Return  Codes 

SIO_SUCCESS 

The  close  succeeded. 

SIO_ERR_INVALID_DESCRIPTOR 

The  FileDescriptor  parameter  does  not  refer  to  a  valid  file  descrip¬ 
tor  previously  returned  by  sio_open(). 

SIO_ERRJO -FAILED 

A  physical  I/O  error  caused  the  close  to  fail. 
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8.4  sio  .unlink 

Purpose 

Remove  a  file  from  the  parallel  file  system. 

Syntax 

^include  <sio_fs.h> 

sio_return_t  sio_unlink(const  char  *Name); 

Parameters 

Name  The  name  of  the  file  to  remove. 

Description 

This  function  removes  a  file  from  the  parallel  file  system,  deallocating 
any  space  that  was  allocated  for  the  file.  The  semantics  of  unlinking  an 
open  file  axe  implementation-specific;  possibilities  include  (but  are  not 
limited  to)  allowing  tasks  which  have  this  file  open  to  continue  to  use 
their  open  file  descriptors,  allowing  subsequent  I/O  operations  on  the 
file  to  fail,  and  allowing  sio_unlink()  itself  to  fail  if  the  file  is  open. 

Return  Codes 

SIO_SUCCESS 

The  unlink  succeeded. 

SIO_ERRJFILE_NOT_FOUND 
The  file  did  not  exist. 

SIO_ERR_FILE_OPEN 

The  file  Name  is  open  and  the  implementation  does  not  allow  open 
files  to  be  unlinked. 

SIO_ERRJNVALID -FILENAME 

The  Name  parameter  is  not  a  legal  file  name. 

SIO_ERRJO_FAILED 

A  physical  I/O  error  caused  the  unlink  to  fail. 


8.5  sio-test 


47 


889  8.5  sio-test 


890  Purpose 

891  Use  mode  and  control  operations  to  determine  attributes  of  a  file  by 

892  name,  without  opening  the  file. 


893  Syntax 

894  :j^include  <sio_fs.h> 

895  sio_return_t  sio_test (const  char  *Name,  sio_mode_t  Mode, 

896  sio-controLt  *  ControlOps, 

897  sio_count_t  ControlOpCount)] 


898  Parameters 


Name  The  name  of  the  target  file. 

Mode  The  access  mode  to  be  tested.  May  include  one  or 
more  of  SIO_MODE_READ,  SIO_MODE_WRITE,  and 
SIOJVIODE_CREATE,  ORed  together. 

ControlOps  An  array  of  control  operations  to  be  performed  on  the  file. 

ControlOpCount  The  number  of  operations  in  the  array  specified  by 
ControlOps. 


906  Description 

907  This  function  allows  an  application  to  test  for  the  existence  of  a  file  or 

908  test  whether  a  file  can  be  created,  and  get  the  attributes  of  the  file, 

909  without  opening  or  creating  the  file. 

910  This  function  is  similar  to  sio_open(),  except  for  two  differences: 

911  •  It  does  not  actually  open  or  create  the  specified  file. 

912  •  It  is  not  allowed  to  perform  any  control  operations  that  change 

913  the  permanent  state  of  the  file  system. 


This  function  may  only  use  controls  that  do  not  change  the 
permanent  state  of  the  file  system.  Of  the  controls  de¬ 
fined  in  this  document,  only  the  following  may  be  performed 

by  sio_test():  SIO_CTL_GetSize  SIO_CTL_GetAllocation 
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SIO_CTL_GetPreallocation  SIO_CTL_GetLayout 

SIO_CTL_GetLabel  SIO_CTL_GetConsistencyUnit. 


Controls  that  change 

file  state  will  return  SIO_ERR_CONTROL_NOT_ON_TEST.  If 
implementation-specific  controls  are  defined,  the  implementation  must 
specify  whether  or  not  each  additional  control  modifies  file  state. 

Provided  a  disallowed  control  is  not  specified,  this  function  succeeds  if 
a  call  to  sio_open()  with  the  same  parameters  would  have  succeeded. 

If  this  function  fails  for  any  reason,  then  the  result  codes  of  the  indi¬ 
vidual  Ops  are  set  in  the  same  manner  that  sio_open()  sets  the  result 
codes  of  its  Ops. 


929  Return  Codes 


930 

931 

932 

933 

934 

935 

936 

937 

938 

939 

940 

941 

942 

943 

944 

945 

946 

947 


SIO_SUCCESS 

The  test  succeeded. 

SIO_ERR_ALREADY_EXISTS 

SIO_MODE_CREATE  was  specified  and  the  file  already  exists. 

SIO_ERR_CONTROL_FAILED 

At  least  one  of  the  mandatory  control  operations  would  have 
failed. 

SIO_ERR_CONTROL_NOT_ON_TEST 

At  least  one  of  the  control  operations  changes  the  file  state  and 
may  not  be  used  with  sio_test(). 

SIO_ERR_CONTROLS_CLASH 

Some  of  the  mandatory  control  operations  specified  are  incompat¬ 
ible  with  each  other  and  cannot  be  performed  together  by  this 
implementation. 

SIO_ERRJFILE_NOTJFOUND 

The  file  did  not  exist  and  SIO_MODE_CREATE  was  not  spec¬ 
ified. 


948 


SIO_ERRJNVALID -FILENAME 

The  Name  parameter  is  not  a  legal  file  name. 
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SIO_ERR_IO_FAILED 

A  physical  I/O  error  caused  the  function  to  fail. 

SIO_ERR_MAX_OPEN_EXCEEDED 

Opening  the  file  would  result  in  the  task  having  more  than 
SIO_MAX_OPEN  open  file  descriptors. 
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8.6  siojrename 

Purpose 

Rename  a  file. 

Syntztx 

^include  <sio_fs.h> 

sio_return_t  sio_rename(const  char  *OldName, 

const  char  *NewName)', 


Parameters 

OldName  The  current  name  of  the  file. 

NewName  The  new  name  of  the  file. 

Description 

This  function  changes  the  name  of  the  file  OldName  to  NewName. 
The  semantics  of  renaming  an  open  file  are  implementation-specific; 
possibilities  include  (but  are  not  limited  to)  allowing  tasks  which  have 
this  file  open  to  continue  to  use  their  open  file  descriptors,  allowing 
subsequent  I/O  operations  on  the  file  to  fail,  and  allowing  the  rename 
itself  to  fail  if  the  file  is  open. 

Return  Codes 

SIO_SUCCESS 

The  rename  succeeded. 

SIO_ERR^LREADY_EXISTS 

NewName  already  exists. 

SIO_ERRJFILE_NOT_FOUND 

OldName  did  not  exist. 

SIO_ERR_FILE_OPEN 

The  file  OldName  is  open  and  the  implementation  does  not  allow 
open  files  to  be  renamed. 

SIO_ERRJNVALID -FILENAME 

One  of  the  file  names  is  not  a  valid  name  for  a  file. 
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SIO_ERR_IO_FAILED 

A  physical  I/O  error  caused  the  function  to  fail. 
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985  9  Synchronous  File  I/O 

986  This  section  introduces  new  functions  for  file  read  and  write  operations. 

987  These  provide  file  system  functions  previously  unavailable  in  UNIX  systems, 

988  as  they  allow  strided  scatter  and  gather  of  data  in  memory  and  also  in  a  file. 

989  One  of  the  primary  performance-limiting  problems  for  file  systems  and  paral- 

990  lei  programs  arises  when  the  data-moving  interfaces  are  restricted  to  moving 

991  single  contiguous  regions  of  bytes.  This  restriction  causes  applications  to  ask 

992  too  frequently  for  small  amounts  of  work  and  it  denies  the  system  the  ability 

993  to  obtain  performance  benefits  from  grouping  (batching,  scheduling,  coalesc- 

994  ing).  Our  first  step  toward  removing  this  limitation  is  to  offer  interfaces  that 

995  allow  the  transfer  of  multiple  ranges  in  a  file  to  or  from  multiple  ranges  in 

996  memory.  We  call  this  capability  scatter-gather. 

997  The  read  and  write  operations  introduced  in  this  section  are  not  like  tradi- 

998  tional  read/write  operations.  Rather  than  describing  file  and  memory  ad- 

999  dresses  as  linear  buffers,  these  calls  describe  them  as  lists  of  strided  accesses. 

1000  Each  element  of  the  list  specifies  a  single  strided  access,  consisting  of  a  start- 

1001  ing  address  (offset),  size  of  each  contiguous  region,  stride  between  the  con- 

1002  tiguous  regions,  and  the  total  number  of  regions  in  the  strided  access  (see 

1003  Section  4  for  the  formats  of  these  elements).  Data  are  copied  from  the  source 

1004  buffer  to  the  destination  in  canonical  order.  The  canonical  order  of  an  indi- 

1005  vidual  strided  access  is  the  sequence  of  contiguous  byte  regions  specified  by 

1006  the  access.  The  canonical  order  for  a  list  of  strided  accesses  is  simply  the 

1007  concatenation  of  the  canonical  orders  for  the  strided  accesses.  Intuitively,  all 

1008  byte  regions  specified  by  the  canonical  ordering  in  a  file  are  concatenated  into 

1009  a  contiguous  zero-address  based  virtual  window.  The  byte  regions  specified 

1010  in  memory  are  also  concatenated  in  canonical  order  into  this  virtual  window. 

1011  Each  byte  of  the  virtual  window  corresponds  to  one  byte  of  the  file  and  also 

1012  to  one  byte  of  memory.  The  number  of  bytes  specified  in  the  two  lists  must 

1013  be  equal. 

1014  We  place  no  restrictions  on  the  values  of  addresses  occurring  in  the  canonical 

1015  ordering  of  the  data  structure  from  the  file  or  memory.  This  mapping  may 

1016  be  increasing,  decreasing  or  non-monotonic  in  the  file  or  memory,  and  may 

1017  cover  a  given  byte  more  than  once. 
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1018  Note  that  the  file  system  need  not  access  the  file  or  memory  in  canonical 

1019  order.  Data  can  be  accessed  in  the  file  or  memory  in  any  sequence  as  preferred 

1020  by  the  file  system  to  optimize  performance.  The  canonical  sequence  of  file 

1021  regions  is  used  only  to  compute  the  association  of  the  file  data  with  memory 

1022  regions. 

1023  If  the  source  list  (i.e.  the  memory  buffer  during  a  write  or  the  file  buffer 

1024  during  a  read)  contains  the  same  region  more  than  once  then  its  data  will 

1025  be  copied  into  the  destination  buffer  multiple  times.  If  the  destination  list 

1026  contains  the  same  region  more  than  once  then  the  resulting  contents  of  the 

1027  duplicated  region  are  undefined.® 

1028  Applications  must  not  access  an  I/O  operation’s  memory  buffer  while  the 

1029  operation  is  in  progress.  For  example,  a  thread  in  a  multi-threaded  appli- 

1030  cation  must  not  read  or  write  a  buffer  while  another  thread  has  an  I/O  in 

1031  progress  using  the  same  buffer.  Failure  to  avoid  such  accesses  may  corrupt 

1032  the  task  and/or  file  in  undefined  ways,  including  leaving  the  contents  of  the 

1033  file  corrupted  or  causing  the  task  to  fault.  Applications  that  wish  to  share 

1034  I/O  buffers  between  threads  must  explicitly  synchronize  the  threads’  accesses 

1035  to  those  buffers. 

1036  It  is  expected  that  many  users  of  this  API  will  desire  simpler  interfaces  to 

1037  this  functionality.  In  addition  to  the  basic  POSIX  interfaces,  the  interfaces  in 

1038  Appendix  B  are  easily  built  on  the  interfaces  provided  in  this  API.  These,  or 

1039  similar  simplified  interfaces,  could  easily  be  provided  by  a  high-level  library, 

1040  and  are  not  defined  by  this  API. 


®No  function  to  check  for  duplicate  regions  in  the  destination  list  is  provided.  However, 
such  a  function  could  be  implemented  as  part  of  a  higher-level  library  built  on  top  of  this 
API. 


9.1  siosg-read,  siosg-write 


55 


9.1  sio_sg_read,  sio_sg_write 


1042  Purpose 

1043  Transfer  data  between  a  file  and  memory. 


1044 

1045 

1046 

1047 

1048 

1049 

1050 

1051 


Syntax 

#include  <sio_fs.h> 

sio_return_t  sio_sg_read(int  FileDescriptor, 

const  sio_fileJoJist_t  *FileList, 
sio_count_t  FileListLength, 
const  sio_mem_io_list_t  *  Memory  List, 
sio_count_t  Memory  List  Length, 
sio-transfer Jen_t  *  TotalTransf erred)', 


1052 

1053 

1054 

1055 

1056 

1057 


sio_return_t  sio_sg_write(int  FileDescriptor, 

const  sio_fileJo_list_t  *FileList, 
sio_count_t  FileListLength, 
const  sio_memJoJist_t  *MemoryList, 
sio_count_t  Memory ListLength, 
sio_transfer Jen_t  * TotalTransf erred)'. 


1058  Parameters 
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1063 

1064 

1065 

1066 


FileDescriptor  The  file  descriptor  of  an  open  file. 

FileList  Specification  of  file  data  to  be  read  or  written. 

FileListLength  Number  of  elements  in  FileList. 

MemoryList  Specification  of  the  memory  buffer  containing  data  to  be 
read  or  written. 

Memory  ListLength  Number  of  elements  in  MemoryList. 

TotalTransf  erred  Used  to  return  the  total  number  of  bytes  read  or  writ¬ 
ten. 


1067  Description 

1068  These  functions  move  data  between  a  list  of  file  locations  and  a  list 

1069  of  memory  locations.  All  I/O  must  be  done  to  a  single  file,  in  the 

1070  FileDescriptor  argument. 
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The  mapping  between  the  collection  of  file  regions  specified  by  FileList 
and  the  collection  of  memory  byte  regions  specified  by  MemoryList 
is  in  matching  indices  in  the  canonical  ordering  of  the  corresponding 
sio_fileJoJist_t  and  sio_mem jo_list_t. 

If  the  total  transfer  cannot  be  completed  because  a  file  address  is  not 
valid  (i.e.  reading  beyond  the  end  of  the  file),  these  interfaces  will 
complete  successfully,  and  return  in  TotalTransferred  the  index  of  the 
first  byte  in  the  canonical  ordering  that  could  not  be  transferred  (fol¬ 
lowing  the  UNIX  example);  bytes  preceding  this  index  in  the  canonical 
ordering  have  been  transferred  successfully  and  bytes  following  (and 
including)  it  may  or  may  not  have  been  transferred  successfully. 

Implementations  may  return  a  value  less  than  the  actual  amount  trans¬ 
ferred  if  the  operation  was  not  successful;  in  particular,  an  implemen¬ 
tation  may  indicate  that  zero  bytes  were  transferred  successfully  on  all 
failures. 

Return  Codes 

SIO_SUCCESS 

The  function  succeeded. 

SIO_ERRJNCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  I/O. 

SIO_ERRJNVALID -DESCRIPTOR 

FileDescriptor  does  not  refer  to  a  valid  file  descriptor. 

SIO_ERRJNVALID_FILE_LIST 

The  file  regions  described  by  FileList  are  invalid,  e.g.  they  contain 
illegal  addresses. 

SIO_ERRJNVALID_MEMORY_LIST 

The  memory  regions  described  by  MemoryList  are  invalid,  e.g. 
they  contain  illegal  addresses. 

SIO_ERRJO_FAILED 

A  physical  I/O  error  caused  the  function  to  fail. 

SIO_ERRJNO_SPACE 

The  file  system  ran  out  of  space  while  trying  to  extend  the  file. 
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1103 

1104 


SIO_ERR_UNEQUAL_LISTS 

The  number  of  bytes  in  MemoryList  and  FileList  are  not  equal. 


58 


9  SYNCHRONOUS  FILE  I/O 


59 


1105  10  Asynchronous  File  I/O 


1106  Asynchronous  I/O  allows  a  single-threaded  task  to  issue  concur- 

1107  rent  I/O  requests.  The  parallel  file  system  supports  up  to 

1108  SIO_MAX_ASYNC_OUTSTANDING  (see  Section  5.1)  asynchronous 

1109  I/Os  at  a  time  for  each  task.  Asynchronous  I/O  functions  merely  initiate  an 

1110  I/O,  returning  to  the  task  a  handle  that  may  be  used  by  the  task  to  wait  for 
nil  the  I/O  to  complete,  to  check  its  status  of  the  I/O,  or  to  cancel  the  I/O. 

1112  These  handles  are  of  type  sio_async_handle_t,  which  is  an  opaque  type 

1113  defined  by  the  system.  Only  the  task  that  issued  the  asynchronous  I/O  is 

1114  able  to  use  the  sio_async_handle_t  associated  with  the  I/O  to  retrieve  the 

1115  status  of  or  cancel  the  I/O.  Other  tasks  that  wish  to  retrieve  the  status  of  or 
cancel  an  I/O  must  contact  the  task  that  initiated  the  I/O. 


1116 
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1117  10.1  sio_async_sg_read,  sio_async_sg_write 


1118  Purpose 

1119  Asynchronously  transfer  data  between  a  file  and  memory. 


1120  Syntax 


1121 

1122 

1123 

1124 

1125 

1126 
1127 


^ijtinclude  <sio_fs.h> 

sio_return_t  sio_async_sg_read(int  FileDescriptor, 

const  sio_file_ioJist_t  *FileList, 
sio_count_t  FileListLength, 
const  sio_memJoJist_t  *MemoryList, 
sio_count_t  Memory ListLength^ 
sio_async_handle_t  *  Handle); 


1128 

1129 

1130 

1131 

1132 

1133 


sio_return_t  sio_async_sg_write(int  FileDescriptor, 

const  sio_fileJoJist_t  *FileList, 
sio_count_t  FileListLength, 
const  sio_memjo_list_t  *MemoryList, 
sio_count_t  MemoryListLength, 
sio_async_handle_t  *  Handle); 


1134  Parameters 


1135  FileDescriptor  The  file  descriptor  of  an  open  file. 

1136  FileList  Specification  of  file  data  to  be  read  or  written. 

1137  FileListLength  Number  of  elements  in  FileList. 

1138  MemoryList  Specification  of  the  memory  buffer  containing  data  to  be 

1139  read  or  written. 


1140  MemoryListLength  Number  of  elements  in  MemoryList. 

1141  Handle  Handle  returned  by  the  operation,  which  can  be  used  later  to 

1142  determine  the  status  of  the  I/O,  to  wait  for  its  completion,  or  to 

1143  cancel  it. 


1144 


Description 
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1145 

1146 

1147 

1148 

1149 

1150 

1151 

1152 

1153 

1154 

1155 

1156 

1157 

1158 

1159 


These  functions  behave  similarly  to  sio_sg_read()  and  sio_sg_write(). 
A  successful  return,  however,  indicates  only  that  the  I/O  has  been 
queued  for  processing  by  the  parallel  file  system. 

Handle  is  a  task-specific  value  which  may  be  used  to  poll  for  comple¬ 
tion,  block  until  the  I/O  completes,  or  cancel  the  I/O.  The  handle  re¬ 
mains  valid  until  either  the  task  completes,  or  sio_async_status_any() 
indicates  that  the  I/O  transfer  associated  with  Handle  is  no  longer 
in  progress.  While  a  handle  is  valid  it  counts  towards  the 
SIO_MAX_ASYNC_OUTSTANDING  asynchronous  I/Os  that  a 
task  may  have. 

As  in  synchronous  I/O,  applications  must  neither  access  nor  modify  the 
contents  of  a  memory  buffer  while  an  asynchronous  I/O  is  in  progress 
on  that  buffer.  Doing  so  may  leave  the  buffer  and/or  the  file  in  an 
undefined  state,  and  may  cause  the  task  to  fault.  See  Section  9  for 
details. 


1160  Return  Codes 


1161 

1162 

1163 

1164 

1165 

1166 

1167 

1168 

1169 

1170 

1171 

1172 

1173 

1174 

1175 

1176 

1177 


SIO_SUCCESS 

The  function  succeeded. 

SIO_ERR_INCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  allow  the  I/O. 

SIO_ERR_INVALID_DESCRIPTOR 

FileDescriptor  does  not  refer  to  a  valid  file  descriptor. 

SIO_ERR_INVALID_FILEXIST 

The  file  regions  described  by  FileList  are  invalid,  e.g.  they  contain 
illegal  addresses.  Implementations  may  defer  returning  this  error 
until  sio_async_status_any()  is  invoked  on  the  I/O. 

SIO_ERR_INVALID_MEMORY_LIST 

The  memory  regions  described  by  MemoryList  are  invalid,  e.g. 
they  contain  illegal  addresses.  Implementations  may  defer  return¬ 
ing  this  error  until  sio_async_status_any()  is  invoked  on  the 
I/O. 

SIO_ERR_IO_FAILED 

A  physical  I/O  error  caused  the  function  to  fail. 
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SIO_ERRJV[AX_ASYNC_OUTSTANDING_EXCEEDED 

The  I/O  request  could  not  be  initiated  because  doing  so  would 
cause  the  calling  task’s  number  of  outstanding  asynchronous  I/Os 
to  exceed  the  limit. 

SIO_ERRJ\0_SPACE 

The  file  system  ran  out  of  space  while  trying  to  extend  the 
file.  Implementations  may  defer  returning  this  error  until 
sio_async_status_any()  is  invoked  on  the  I/O. 

SIO_ERR_UNEQUAL_LISTS 

The  number  of  bytes  in  MemoryList  and  FileList  are  not 
equal.  Implementations  may  defer  returning  this  error  until 
sio_async_status_any()  is  invoked  on  the  I/O. 


10.2  sio.asyncstatus-any 
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1190  10.2  sio_async_status_any 


1191  Purpose 

Get  the  status  of  asynchronous  I/Os. 

Syntax 

^include  <sio_fs.h> 
sio_return_t  sio_async_status_any( 

sio_asyncJiandle_t  * HandleList, 
sio_count_t  HandleListLength, 
sio_count_t  *Index, 
sio_async_status_t  *Status, 
sio_async_flags_t  Flags)', 

1201  Parameters 


1193 

1194 

1195 

1196 

1197 

1198 

1199 

1200 


1202 

1203 

1204 

1205 

1206 

1207 

1208 
1209 


HandleList  An  array  of  sio_asyncJiandle_ts  identifying  the  asyn¬ 
chronous  I/Os  for  which  status  is  desired. 

HandleListLength  The  number  of  elements  in  HandleList. 

Index  Used  to  return  the  index  of  handle  within  HandleList  for  which 
status  is  returned. 

Status  Pointer  to  an  sio_async_status_t  to  be  filled  in. 

Flags  Determines  whether  or  not  the  operation  blocks  or  returns  im¬ 
mediately. 


1210 

1211 

1212 

1213 

1214 

1215 

1216 

1217 

1218 


Description 

This  function  retrieves  the  status  of  one  of  the  asynchronous  I/Os  spec¬ 
ified  by  HandleList.  The  index  of  the  handle  within  HandleList  for 
which  the  status  is  returned  is  stored  in  Index.  The  system  may  return 
the  status  for  any  of  the  handles,  provided  that  if  any  of  the  I/Os  are 
complete  or  canceled,  then  the  status  for  one  of  these  I/Os  is  returned 
and  not  the  status  of  an  I/O  that  is  still  in  progress. 

It  is  important  to  note  that  once  the  status  for  an  I/O  indi¬ 
cates  that  the  I/O  is  no  longer  in  progress  (i.e.  it  completed 
or  was  canceled)  the  handle  for  the  I/O  is  no  longer  valid.  If 


1219 
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1220 

1221 

1222 

1223 

1224 

1225 

1226 

1227 

1228 

1229 

1230 

1231 

1232 

1233 

1234 

1235 

1236 

1237 

1238 


it  is  subsequently  passed  to  sio_async_status_any()  the  value 
SIOJERRJN  VALID -HANDLE  will  be  returned  if  the  handle  is 
still  invalid,  otherwise  the  status  of  the  new  asynchronous  I/O  will  be 
returned  if  the  handle  has  been  reused. 

The  task  may  place  a  dummy  handle  in  the  HandleList  by  setting  the 
entry  to  SIO_ASYNC_DUMMY_HANDLE.  The  system  ignores  a 
handle  with  this  value,  allowing  the  task  to  retrieve  the  status  for  a  set 
of  handles  using  the  same  HandleList  array,  by  replacing  the  handle  for 
the  I/O  just  finished  with  the  dummy  value. 

If  the  Flags  parameter  includes  SIO_ASYNC_BLOCKING,  this 
function  will  not  return  until  at  least  one  of  the  I/Os  has  completed.  If 
it  includes  SIO_ASYNC_NONBLOCKING,  this  function  returns 
immediately,  regardless  of  whether  or  not  one  of  the  I/Os  has  com¬ 
pleted. 

Note  to  implementors:  When  an  I/O  is  canceled  the  count  field  in  Status 
will  contain  the  number  of  bytes  guaranteed  to  have  been  transferred 
prior  to  the  cancellation.  Implementations  may  always  set  this  value 
to  zero,  indicating  that  none  of  the  bytes  are  guaranteed  to  have  been 
transferred. 


1239  Status  Results 

1240  The  following  values  are  returned  in  the  result  field  of  the  Status  struc- 

1241  ture,  indicating  the  status  of  the  I/O: 


1242  SIO-SUCCESS 

1243  The  I/O  has  completed  or  been  canceled.  The  count  field  contains 

1244  the  number  of  bytes  transferred. 

1245  SIO_ERRJNVALID_FILE_LIST 

1246  The  file  regions  described  by  the  FileList  parameter  passed  to  the 

1247  function  that  initiated  the  I/O  are  invalid,  e.g.  they  contain  illegal 

1248  addresses. 


SIO_ERR_INVALID -MEMORY-LIST 

The  memory  regions  described  by  the  MemoryList  parameter 
passed  to  the  function  that  initiated  the  I/O  are  invalid,  e.g.  they 
contain  illegal  addresses. 
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1253 

1254 

1255 

1256 

1257 

1258 

1259 

1260 
1261 

1262 

1263 

1264 

1265 

1266 

1267 

1268 

1269 

1270 

1271 

1272 

1273 

1274 

1275 

1276 

1277 

1278 


SIO_ERR_IO_CANCELED 

The  I/O  was  canceled  without  completing.  The  count  field  con¬ 
tains  the  number  of  bytes  guaranteed  to  have  been  transferred 
successfully  prior  to  the  cancellation.  Implementations  may  set 
count  to  zero. 

SIOJERR_IO_FAILED 

A  physical  I/O  error  caused  the  function  to  fail. 

SIO_ERR_IO_IN_PROGRESS 

The  I/O  is  still  in  progress. 

SIO_ERR_MIXED_COLL_AND_ASYNC 

The  implementation  does  not  support  mixing  of  asynchronous  and 
collective  I/O  handles,  and  a  mix  of  handle  types  was  supplied. 

SIO_ERR_NO_SPACE 

The  file  system  ran  out  of  space  while  trying  to  extend  the  file. 

SIO_ERR_UNEQUAL_LISTS 

The  size  of  the  memory  buffer  doesn’t  match  size  of  the  file  regions 
to  be  accessed. 

Return  Values 

SIO_SUCCESS 

An  I/O  has  completed  or  been  canceled,  the  index  and  result  of 
which  are  stored  in  Index  and  Status,  respectively. 

SIO_ERR_INVALID_HANDLE 

At  least  one  of  the  elements  of  HandleList  is  neither  a  valid  handle 
for  an  asynchronous  I/O  nor  a  dummy  handle.  Index  will  contain 
the  index  of  one  of  the  invalid  handles. 

SIO-ERRJOJN-PROGRESS 

All  I/Os  are  still  in  progress. 
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1291 

1292 

1293 

1294 

1295 

1296 

1297 

1298 

1299 
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10.3  sio_async_cancel_all 

Purpose 

Request  that  a  collection  of  asynchronous  I/Os  be  canceled. 

Syntax 

^i^include  <sio_fs.h> 
sio_return_t  sio_async_cancel_all( 

sio_async_handle_t  *HandleList, 
sio_count_t  HandleListLength); 

Parameters 

HandleList  An  array  of  sio_async_handle_ts  identifying  the  asyn¬ 
chronous  I/Os  to  be  canceled. 

HandleList  Length  The  number  of  elements  in  HandleList. 

Description 

This  function  is  used  to  request  that  asynchronous  I/Os  be  canceled. 
It  is  not  guaranteed  that  the  I/O  will  not  complete  in  full  or  in  part; 
an  implementation  may  ignore  cancel  requests.  A  canceled  read  leaves 
the  contents  of  the  I/O’s  memory  buffer  undefined.  Likewise,  if  a  write 
is  canceled,  the  contents  of  the  regions  of  the  file  regions  being  written 
are  undefined. 

The  status  of  a  canceled  request  remains  available  until  an 
sio_async_status_any()  reports  its  completion.  An  application 
should  test  for  this  status  or  its  maximum  outstanding  asynchronous 
I/Os  will  appear  to  diminish. 

Note  to  implementors: 

An  implementation  may  ignore  cancellation  requests  altogether.  In  this 
case  a  call  to  sio_async_status_any()  on  an  I/O  that  whose  cancel¬ 
lation  was  requested  should  return  the  normal,  uncanceled  completion 
status  of  the  I/O. 

Note  to  implementors:  Implementations  are  encouraged  to  avoid 
reusing  the  same  handles  for  different  asynchronous  I/Os  within  the 
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1310  same  task.  A  handle  becomes  invalid  once  the  I/O  is  no  longer  in 

1311  progress  and  its  status  has  been  retrieved,  but  bugs  may  cause  a  task 

1312  to  use  such  an  invalid  handle.  If  the  system  has  reassigned  the  handle 

1313  to  a  new  I/O  the  task  will  end  up  affecting  the  new  I/O,  instead  of 

1314  getting  an  invalid  handle  error.  Although  this  behavior  is  caused  by 

1315  a  bug  in  the  application,  avoiding  reuse  of  handles  will  help  track  the 

1316  problem. 

1317  Return  Values 

1318 

1319 

1320 

1321 

1322 

1323 


SIO_SUCCESS 

The  request  for  cancellation  was  accepted.  This  does  not  mean 
that  the  I/Os  were  actually  canceled. 

SIO_ERR_INVALID -HANDLE 

One  of  the  elements  in  HandleList  is  not  a  valid  handle  for  an 
asynchronous  I/O. 


i 


i 
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1324  11  File  Access  Pattern  Hints 

1325  File  access  pattern  hints  provide  a  useful  mechanism  for  users  and  libraries 

1326  to  disclose  the  intended  use  of  file  regions  to  the  file  system.  The  hints, 

1327  if  properly  given,  allow  file  systems  to  implement  significant  performance 

1328  optimizations.  Many  parallel  scientific  programs,  for  example,  have  access 

1329  patterns  that  are  anathemic  to  some  file  system  architectures.  These  appli- 

1330  cations  could  benefit  if  the  file  system  accepted  access  hints  that  protected 

1331  the  application  from  the  performance  consequences  of  the  default  file  system 

1332  behavior.  For  example,  access  hints  can  be  used  by  the  file  system  to  choose 

1333  caching  and  pre-fetching  policies. 

1334  Hints  are  issued  with  the  sio_hint()  and  sio_hint_by_name()  interfaces 

1335  described  in  Section  11.3.  These  interfaces  indicate  a  file,  a  hint  class,  and 

1336  a  list  of  hints.  Hints  apply  only  to  the  future  accesses  of  the  task  passing 

1337  in  the  hints,  they  are  not  associated  with  the  accesses  of  other  tasks.  There 

1338  are  two  hint  classes  specified  in  this  API;  ordered  and  unordered.  Vendors 

1339  are  encouraged  to  extend  this  API  with  vendor-defined  hint  classes,  which 

1340  must  have  names  beginning  with  SIO_HINT_CLASS_VEND_.  Within 

1341  any  class  of  hints,  the  interaction  of  all  hint  types  must  be  specified,  but 

1342  the  interaction  of  hint  types  from  different  classes  need  not  be  specified.  In 

1343  particular,  two  calls  issuing  hints  with  different  hint  classes  for  the  same 

1344  open  file  may  not  be  meaningful  to  an  implementation.  However,  since  the 

1345  information  in  these  hints  are  not  commands,  the  file  system  implementation 

1346  has  broad  freedom  not  to  act  where  hint  combinations  are  not  meaningful. 

1347  The  intent  of  hints  is  to  allow  the  application  to  precisely  specify  what  its 

1348  future  access  patterns  will  be.  The  hint  interface  does  not  provide  specific 

1349  guarantees  of  how  implementations  will  interpret  these  hints.  Different  im- 

1350  plementations  are  free  to  choose  different  strategies  for  responding  to  hints 

1351  (including  ignoring  them  completely),  but  the  application’s  description  of  its 

1352  future  accesses  must  conform  to  this  interface. 

1353  System  performance  may  be  degraded  due  to  inaccurate  hints.  Implementa- 

1354  tions  should  attempt  to  protect  against  such  performance  degradation,  but 

1355  are  not  required  to.  Similarly,  applications  should  not  assume  that  the  file 

1356  system  can  always  limit  the  performance  impacts  of  inaccurate  hints  (ac- 
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cesses  that  have  been  hinted,  but  will  not  actually  be  performed)  and  should 
make  use  of  the  cancel  options  to  minimize  these  effects. 


11.1  Ordered  Hints 
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1359  11.1  Ordered  Hints 

1360  In  a  set  of  ordered  hints,  each  hint  indicates  a  particular  future  access  to 

1361  be  issued  by  the  calling  task,  and  the  sequence  of  issued  hints  indicates  the 

1362  order  of  these  future  accesses.  The  total  order  of  future  accesses  expressed 

1363  by  multiple  invocations  of  the  hint  interfaces  is  determined  by  logically  con- 

1364  catenating  the  hint  array  in  each  invocation  onto  the  end  of  the  hint  array 

1365  built  by  previously  issued  hints.  This  allows  access  to  different  files  to  be 

1366  ordered.  The  accesses  to  different  files  predicted  by  one  hint  are  expected  to 

1367  occur  after  the  accesses  predicted  by  all  hints  preceding  it  in  the  total  order, 

1368  and  before  the  accesses  predicted  by  all  hints  following  it  in  the  total  order. 

1369  The  flag  field  of  each  sio_hint_t  in  the  class  of  ordered  hints  can  contain  the 

1370  following  flags  that  can  be  ORed  with  each  other: 


SIO_HINT_READ  or  SIO_HINT_WRITE 

SIO-HINTJREAD  indicates  the  hint  describes  a  read  access. 
SIO_HINT_WRITE  indicates  the  hint  describes  a  write  access. 

Exactly  one  of  these  flags  must  be  specified  for  each  hint.  When  used 
to  cancel  a  hint  the  flags  in  the  cancel  request  must  match  the  hint’s 
flags. 

SIO_HINT_CANCEL_ALL  or  SIO_HINT_CANCEL_NEXT 

Regardless  of  the  file  specified  by  the  hint  interface  call  and  the 
regions  specified  by  the  ioJist  fields  in  the  sio_hint_t  structures, 
SIO_HINT_CANCEL_ALL  indicates  that  all  previously  issued  hints 
should  be  ignored. 

SIO_HINT_CANCEL_NEXT  indicates  that  the  previously  is¬ 
sued  hint  matching  the  file  and  region  specified  with  this 
SIO_HINT_CANCEL_NEXT  whose  predicted  access  is  next  to  oc¬ 
cur  should  be  ignored.  A  hint  is  considered  “outstanding”  if  the  data 
transfer  request  predicted  by  the  hint  has  not  yet  occurred.  It  is  ex¬ 
pected  the  data  transfer  requests  will  take  place  in  the  sequence  given 
by  the  total  ordered  list  of  hints  for  the  task,  with  the  possibility  that 
not  all  transfer  requests  will  have  corresponding  hints.  The  “next  out¬ 
standing  hint”  will  be  the  first  matching  hint  in  the  set  of  ordered  hints 
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previously  issued  by  this  task  for  which  no  corresponding  for  transfer 
request  has  occurred. 

A  previously  issued  hint’s  profile  “matches”  the  current  hint’s  pro¬ 
file  if  the  hints  pertain  to  the  same  file,  and  the  regions  specified  by 
the  ioJist  entry  in  the  sio_hint_t  structures  are  the  same  and  the 

SIO_HINT_READ  or  SIO_HINT_WRITE  flag  matches. 

No  more  than  one  of  these  flags  may  be  specified  for  each  hint. 

Note  to  implementors:  Implementations  are  not  required  to  keep  track 
of  “outstanding”  hints.  The  concept  of  “outstanding”  only  describes 
the  application’s  intent  in  issuing  the  hint,  and  does  not  describe  the 
implementation’s  behavior.  In  implementations  that  do  not  keep  track 
of  “outstanding”  hints  the  SIO_HINT_CANCEL_NEXT  hint  may 
not  be  useful. 


11.2  Unordered  Hints 
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1404  11.2  Unordered  Hints 

1405  In  an  unordered  set  of  hints,  each  hint  independently  specifies  information 

1406  about  some  set  of  future  accesses.  There  is  no  explicit  ordering  among  the 

1407  accesses  predicted  by  unordered  hints.  These  predictions  remain  in  effect 

1408  until  explicitly  canceled. 

1409  The  flag  field  of  each  sio_hint_t  in  the  class  of  unordered  hints  can  contain 

1410  the  following  flags: 


1411  SIO_HINT_READ  and/or  SIO_HINT_WRITE 

1412  SIO_HINT_READ  indicates  that  the  hint  describes  read  accesses. 

1413  SIO_HINT_WRITE  indicates  that  the  hint  describes  write  accesses. 

1414  If  SIO_HINT_READ  and  SIO_HINT_WRITE  are  given  together, 

1415  they  indicate  that  the  hint  describes  a  read-write  access. 

1416  At  least  one  of  these  flags  must  be  specified  for  each  hint. 


1417 

1418 

1419 

1420 

1421 

1422 

1423 


SIO_HINT_CANCEL_ALL  or  SIO_HINT_CANCEL_MATCHING 
SIO_HINT_CANCEL_ALL  suggests  that  the  file  system  ig¬ 
nore  all  previously  issued  unordered  hints  from  this  task,  re¬ 
gardless  of  the  file  and  file  regions  given  in  any  of  these  hints. 
SIO_HINT_CANCEL_MATCHING  suggests  that  the  file  system 
ignore  all  previously  issued  unordered  hints  from  this  task  which  match 
the  given  sio_hint_t. 


1424 


No  more  than  one  of  these  flags  may  be  specified  for  each  hint. 


1425  SIO_HINT_SEQUENTIAL,  SIO_HINT_RE VERSE, 

1426  SIO_HINT_RANDOM_PARTIAL, 

1427  SIO_HINT_RANDOM_COMPLETE, 

1428  SIO_HINTJNrO_FURTHER_USE,  or  SIO_HINT_WILL_USE 

1429  Each  hint  expresses  an  access  pattern  predicted  for  the  file  region  given 

1430  by  the  hint.  When  changing  a  predicted  access  pattern  on  a  region,  a 

1431  SIO_HINT_CANCEL_MATCHING  hint  should  be  issued  to  can- 

1432  cel  the  old  hint  before  the  new  access  hint  is  issued.  The  interpretation 

1433  of  multiple  predicted  access  patterns  on  the  same  region  or  partial 
(overlapping)  region  is  unspecified.  These  patterns  are: 
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1435 

1436 

1437 

1438 

1439 

1440 

1441 

1442 

1443 

1444 

1445 

1446 

1447 

1448 

1449 

1450 

1451 


SIO_HINT_SEQUENTIAL 

The  entire  region  will  be  accessed  in  non-overlapping  blocks  whose 
starting  offsets  increase  monotonically. 

SIO_HINT_REVERSE 

The  entire  region  will  be  accessed  in  non-overlapping  blocks  whose 
starting  offsets  decrease  monotonically. 

SIO_HINT_RANDOM_COMPLETE 

Accesses  in  the  region  will  have  starting  addresses  and  sizes  that 
vaxy  without  pattern  but  the  entire  region  will  be  accessed. 

SIO_HINT_RANDOM_PARTIAL 

Accesses  in  the  region  will  have  starting  addresses  and  sizes  that 
vary  without  pattern  and  the  entire  region  may  not  be  accessed. 

SIO_HINT_NO_FURTHER_USE 

No  further  accesses  are  expected  in  the  region. 

SIO_HINT_WILL_USE 

All  data  in  the  region  will  be  accessed  although  no  explicit  pattern 
can  be  predicted  or  excluded.^ 


1452 


Exactly  one  of  these  flags  must  be  specified  for  each  hint. 


^This  pattern  should  be  used  in  cases  where  SIO  JIINT_RANDOM_COMPLETE 
cannot  because  the  access  pattern  might  not  be  random. 


1453 

1454 

1455 

1456 

1457 

1458 

1459 

1460 

1461 

1462 

1463 

1464 

1465 

1466 

1467 

1468 

1469 

1470 

1471 

1472 

1473 

1474 

1475 

1476 

1477 

1478 

1479 

1480 


11.3  sioJiint,  sioJimt.byj2ame  75 

11.3  sioJiint,  sio  .hint  _by  .name 

Purpose 

Issue  a  set  of  predictions  about  the  future  accesses  of  this  task. 

Syntax 

T^include  <sio_fs.h> 

sio_return_t  sio_hint(int  FileDescriptor, 

sio_hint_class_t  HintClass, 
const  sio_hint_t  *  Hints, 
sio_count_t  HintCount); 

sio_return_t  sio_hint_by_name(const  char  *FileName, 

sio_hint_class_t  HintClass, 
const  sioJiint-t  *Hints, 
sio_count_t  HintCount); 


Parameters 

FileDescriptor  The  file  descriptor  of  an  open  file  to  which  these  hints 
apply. 

FileName  The  name  of  a  file,  not  necessarily  an  open  file,  to  which 
these  hints  apply. 

HintClass  The  class  of  the  hints  being  issued. 

Hints  An  array  of  file  access  pattern  hints. 

HintCount  The  number  of  entries  in  the  Hints  array. 

Description 

This  function  reports  the  application’s  knowledge  of  future  access  pat¬ 
terns  to  the  file  system.  The  purpose  of  issuing  this  information  is  to 
enable  optimizations  in  the  dynamic  behavior  of  the  parallel  file  sys¬ 
tem.  This  knowledge  is  expressed  as  a  set  of  hints,  all  from  the  same 
hint  class.  The  interpretation  of  mixtures  of  hint  types  from  differ¬ 
ent  hint  classes  is  unspecified.  Hints  can  be  applied  to  an  open  file 
using  sio_hint(),  or  to  a  named  file  (which  may  not  be  open)  using 
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1482 

1483 

1484 

1485 

1486 

1487 

1488 

1489 

1490 

1491 

1492 


sio_hint_by_name().  Each  sio_hint_t  structure  in  the  Hints  array 
describes  a  hint  type  applied  to  a  list  of  file  regions  and  optionally 
hint-specific  arguments. 

If  the  size,  stride,  and  elemenTcnt  fields  for  a  particular 
sio_fileJo_list_t  in  a  hint  are  all  zero,  then  the  region  being  specified 
begins  at  the  offset  given  by  the  offset  field  of  that  sio_file  Jo_list_t 
and  continues  until  the  end  of  the  file.  The  entire  contents  of  a  file  are 
specified  as  the  region  whenever  an  sio_fileJo_list_t  contains  zero  in 
the  four  fields:  offset,  size,  stride  and  element.cnt. 

The  implementation  may  not  act  on  any  specific  hint  or  on  any  hints 
at  all. 


1493  Return  Codes 


1494 

1495 


SIO_SUCCESS 

The  function  succeeded. 


1496 

1497 

1498 

1499 

1500 

1501 

1502 

1503 

1504 

1505 

1506 

1507 


SIO_ERRJFILE_NOT_FOUND 

The  specified  file  did  not  exist. 

SIO_ERR_HINT_TYPES_CLASH 

The  class  of  this  hint  differs  from  the  class  of  another  hint  previ¬ 
ously  issued  for  the  same  file  region.® 

SIO_ERRJNVALID_CLASS 

The  hint  class  given  in  HintClass  is  not  a  valid  hint  class. 

SIO_ERRJNVALID_DESCRIPTOR 

FileDescriptor  does  not  refer  to  a  valid  file  descriptor  created  by 

sio_open(). 

SIO_ERRJNVALID_FILENAME 

The  name  given  by  FUeName  is  invalid. 


®As  mentioned  above,  the  effects  of  mixing  hints  of  different  classes  for  the  same  file 
region  are  undefined.  This  error  code  is  provided  for  implementations  that  attempt  to 
resolve  hints  from  different  classes. 
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12  Client  Cache  Control 


The  basic  API  includes  facilities  to  control  caching  of  data  in  client  memory. 
The  caching  interfaces  are  specified  such  that  it  is  a  valid  implementation 
strategy  to  simply  ignore  all  cache  control  calls.  The  only  requirement  of 
an  implementation  that  ignores  these  calls  is  that  it  must  provide  strongly 
consistent  semantics. 

The  client  caching  mode  of  an  SIO  file  may  be  specified  by  including  the 
SIO_CTL_SetCachingMode  control  operation  when  making  sio_open() 
or  sio_control()  calls. 

This  API  specifies  client  caching  modes  with  the  type  sio_caching_mode_t, 
which  can  have  the  following  values: 


SIO_CACHING_NONE 

Completely  disable  client  caching. 

SIO_CACHING_STRONG 

Allow  strongly-consistent  client  caching.  The  file  system  may  choose 
to  provide  caching  with  strong  sequential  consistency,  or  provide  no 
caching  at  all. 

SIO_CACHING_WEAK 

Allow  weakly-consistent  client  caching.  The  file  system  may  provide  no 
client  caching,  strongly-consistent  client  caching,  or  weakly-consistent 
client  caching. 


Caching  mode  names  beginning  with  SIO_C ACHING-  are  reserved  for 
future  use  by  this  API.  Vendors  may  define  their  own  caching  modes  by 
naming  them  with  the  prefix  SIO_C ACHING- VEND-. 

An  SIO  parallel  file  system  implementation’s  default  client  caching 
mode  must  provide  sequential  consistency.  That  is,  it  must  be  either 
SIO-CACHING-NONE,  SIO-CACHING_STRONG,  or  a  vendor- 
defined  mode  that  provides  strong  sequential  consistency. 
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1536  If  client  caching  is  not  disabled  by  using  a  caching  mode  of 

1537  SIO_CACHING_NONE,  the  file  system  on  a  client  node  is  free  to  main- 

1538  tain  local  copies  of  file  data  for  both  read  and  write  operations. 

1539  In  a  system  with  strongly-consistent  caching,  every  write  forces  the  client 

1540  node  to  immediately  make  the  file  system  aware  that  the  file  has  changed. 

1541  This  also  requires  that  client  nodes  either  check  the  validity  of  cached  data 

1542  before  providing  them  to  applications  to  satisfy  a  read,  or  be  notified  when- 

1543  ever  cached  or  potentially  cached  data  have  changed. 

1544  On  the  other  hand,  weakly-consistent  client  caching  allows  the  file  system  to 

1545  avoid  the  messaging  and  bookkeeping  which  a  sequentially  consistent  caching 

1546  mode  mandates,  while  providing  the  application  with  the  benefits  of  caching. 
1647  With  this  form  of  caching,  client  nodes  may  defer  exposing  all  or  part  of  a  set 

1548  of  changes  to  a  file  until  instructed  otherwise  by  the  application.  Likewise, 

1549  a  client  node  need  not  confirm  the  validity  of  cached  data  with  the  server 

1550  unless  explicitly  instructed  to  do  so  by  the  application. 

1551  An  application  informs  the  file  system  that  data  written  on  a  file  descriptor 

1552  should  become  visible  to  other  readers  via  the  SIO_CTL_Propagate  control 

1653  operation.  If  the  changed  data  have  not  already  been  exposed  to  the  rest 

1654  of  the  file  system,  this  is  done  so  immediately.  Note  that  all,  none,  or  part 

1655  of  this  changed  data  may  already  have  been  exposed  to  the  rest  of  the  file 

1656  system. 

1557  Likewise,  an  application  informs  the  file  system  that  locally  cached  data  may 

1558  be  stale  using  the  SIO_CTL_Refresh  control  operation.  Reads  of  refreshed 
1659  regions  of  a  file  are  guaranteed  to  yield  either  the  most  current  available  data, 

1560  or  data  that  were  not  stale  at  the  time  of  the  most  recent  refresh  operation. 

1561  That  is  to  say,  if  the  data  returned  by  the  read  are  stale,  it  was  made  so  after 
1662  the  refresh. 

1563  It  is  assumed  that  applications  using  weakly-consistent  client  caching  either 

1564  do  not  share  data  between  nodes,  or  provide  their  own  internal  synchroniza- 

1565  tion  to  coordinate  when  nodes  must  propagate  and  refresh  data. 

1566  Thus,  the  way  in  which  a  node  A  would  write  data  which  are  then  read  by 
a  node  B  is: 


1567 
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1568  A  writes  data  to  region  R 

1569  A  propagates  data  in  region  R 

1570  (Implicit:)  A  and  B  synchronize;  B  becomes  aware  that  new  data  in  region 

1571  R  are  available 

1572  B  refreshes  data  in  region  R 

1573  B  reads  data  in  region  R 

1574  The  granularity  of  caching  is  known  as  the  consistency  unit.  This  defines  both 
1576  the  size  and  the  alignment  of  the  blocks  of  data  within  the  file  for  which  the 

1576  file  system  insures  that  all  non-conflicting  writes  are  merged  into  the  file. 

1577  Tasks  on  different  nodes  cannot  use  weak  consistency  and  achieve  consis- 

1578  tent  parallel  updates  within  a  single  consistency  unit.  Any  conflicting  writes 
1679  within  a  single  consistency  unit  will  be  resolved  by  an  arbitrary  selection 
1580  of  a  winning  writer  when  the  data  arrive  at  a  server.  The  size  of  the  con- 
1681  sistency  unit  is  implementation  specific,  and  is  represented  by  the  constant 

1582  SIO_CACHE_CONSISTENCY_UNIT.  Additionally,  the  control  oper- 

1583  ation  SIO_CTL_GetConsistencyUnit  can  be  used  to  retrieve  the  consis- 

1584  tency  unit  for  a  file  descriptor.®  Applications  should  not  make  any  assump- 

1585  tions  about  the  size  of  the  consistency  unit;  it  may  vary  between  individual 

1586  bytes,  cache  lines,  pages,  and  file  blocks  depending  upon  the  implementation 
1687  of  the  file  system. 

1588  The  motivation  for  providing  weakly-consistent  client  caching  as  an  option 
1689  within  the  parallel  file  system  is  to  allow  parallel  applications  that  could  ben- 

1590  efit  from  a  decrease  in  the  total  amount  of  data  being  transferred  between 

1591  clients  and  servers  to  exercise  relatively  fine-grained  control  over  the  consis- 

1592  tency  of  their  local  caches.  SIO_CTL_Propagate  and  SIO_CTL_Refresh 

1593  operations  can  be  piggy-backed  onto  synchronization  steps  that  already  ex- 

1594  ist  in  parallel  applications.  These  primitives  allow  application  programmers 

1595  and  toolkit  developers  the  mechanisms  necessary  to  ensure  consistency  of  the 
1696  local  parallel  file  system  cache,  without  requiring  the  parallel  file  system  to 

1597  enforce  any  consistency  model  itself. 

1598  This  implementation  of  weakly-consistent  caching  is  only  intended  to  cope 

1599  with  sharing  among  the  tasks  of  a  parallel  application.  To  avoid  unintended 

^Currently,  this  should  always  yield  SIO_CACHE_CONSISTENCY_UNIT.  This 
is  intended  to  allow  for  future  extensions,  which  may  provide  different  consistency  units 
within  the  same  implementation. 
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1600  sharing  among  independent  applications,  traditional  methods  based  on  de- 

1601  tecting  conflicts  at  open  time  and  disabling  caching  or  resorting  to  strongly- 

1602  consistent  caching  may  be  used. 

1603  Some  implementations  may  choose  not  to  provide  weak  client  cache  consis- 

1604  tency  by  ignoring  a  SIO_CTL_SetCachingMode  operation  that  specifies 

1605  the  SIO_CACHING_WEAK  mode,  as  well  as  the  SIO_CTLJPropagate 

1606  and  SIO_CTL_Refresh 

1607  operations.  In  this  case,  the  SIO_CTL_GetCachingMode  should  re- 

1608  turn  a  value  of  SIO_CACHING_NONE,  SIO_CACHING_STRONG, 

1609  or  a  sequentially-consistent  vendor-defined  caching  mode  as  appropriate,  and 

1610  SIO_CTL_Propagate  and  SIO_CTL_Refresh  should  always  return  suc- 

1611  cess.  (This  way,  an  application  which  can  tolerate  weakly-consistent  caching 

1612  will  not  see  extraneous  errors  in  its  absence.^” 

1613  Note  that  client  caching  is  controlled  on  a  per-file  descriptor  basis,  so  it 

1614  is  possible  to  have  a  file  opened  with  one  client  caching  mode  on  one  file 

1615  descriptor  and  with  a  different  mode  on  another  file  descriptor. 

1616  Descriptions  of  the  SIO_CTL_GetCachingMode, 

1617  SIO_CTL_SetCachingMode,  SIO_CTL_Propagate, 

1618  SIO_CTL_Refresh,  and  SIO_CTL_GetConsistencyUnit  control  oper- 

1619  ations  are  given  in  Section  13. 

1620  Note  to  implementors:  The  routine  sio_close()  implicitly  performs  a 

1621  SIO_CTL_Propagate  on  the  file  descriptor.  This  causes  all  cached  writes 

1622  to  be  exposed  to  the  file  system  at  the  time  the  file  is  closed,  if  they  have 

1623  not  been  already. 


^'’Since  weak  caching  mode  can  be  implemented  using  strong  caching,  it  is  possible  that 
an  application  running  on  one  node  may  see  data  modifications  that  have  not  yet  been 
propagated  on  a  remote  node.  This  is  normal,  since  a  weakly-consistent  caching  policy 
may  expose  the  results  of  writes  soon  after  or  immediately  as  they  occur. 


81 


1624  13  Control  Operations 


1625  This  section  describes  the  file  control  operations  that  can  be  performed  using 

1626  the  functions  sio_control(),  sio_open(),  sio_test(). 

1627  These  control  operations  allow  properties  of  files,  file  descriptors,  and  the  file 

1628  system  to  be  set  and  retrieved. 

1629  Control  operations  are  performed  by  invoking  sio_open(),  sio_control(),  or 

1630  sio_test()  with  the  list  of  operations  to  be  performed.  Each  operation  de- 

1631  scription,  an  sio_control_t,  includes  the  code  of  the  operation  to  be  per- 

1632  formed,  a  pointer  to  the  data  to  be  manipulated  by  that  operation,  and 

1633  space  for  a  result  code.  In  the  following  sections,  information  is  provided 

1634  about  the  various  operation  codes  that  must  be  implemented  by  file  systems 

1635  that  conform  to  this  API. 

1636  Operation  names  beginning  with  SIO_CTL_  are  reserved  for  use  by  this 

1637  API.  Operation  names  beginning  with  SIO_CTL_VEND_  may  be  used  by 

1638  vendors  to  define  vendor-specific  operations. 
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13.1  SIO_CTL_GetSize,  SIO_CTL_SetSize 


1640  Purpose 

1641  Get  or  set  the  file’s  logical  size. 

1642  Affects 

1643  Open  file 

1644  Parameter  Type 

1646  Pointer  to  a  sio_offset_t. 


1646 

1647 

1648 

1649 

1650 

1651 

1652 

1653 

1654 

1655 

1656 

1657 


Description 

Applications  may  query  and  adjust  the  logical  size  (see  Section  6.1) 
of  a  file  using  these  control  operations.  The  SIO_CTL_SetSize  op¬ 
eration  causes  the  logical  size  of  the  file  to  be  set  to  the  value  in  the 
sio_offset_t  pointed  to  by  the  op-data  field  of  the  sio_control_t.  Set¬ 
ting  a  file’s  logical  size  may  change  the  amount  of  storage  that  the  file 
uses,  but  is  not  guaranteed  to  do  so.  An  application  wishing  to  preal¬ 
locate  storage  for  a  file  should  use  the  SIO_CTL_SetPreallocation 
control  operation. 

The  SIO_CTL_GetSize  operation  causes  the  logical  size  of  the  file 
being  operated  on  to  be  placed  in  the  sio_offset_t  pointed  to  by  the 
op-data  member  of  the  sio_control_t. 


1658  Result  Values 


1659 

1660 

1661 

1662 

1663 

1664 

1665 

1666 


SIO_SUCCESS 

The  operation  succeeded. 

SIO_ERRJNCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  operation. 

SIO_ERRJO_FAILED 

A  physical  I/O  error  caused  the  operation  to  fail. 

SIO_ERRJ^O_SPACE 

The  system  needs  to  increase  the  amount  of  storage  used  by  the 
file  but  cannot. 
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13.2  SIO_CTL_GetAllocation 

Purpose 

Get  the  file’s  physical  size. 

Affects 

Underlying  file. 

Parameter  Type 

Pointer  to  a  sio_ofFset_t. 

Description 

The  SIO_CTL_GetAllocation  operation  causes  file’s  physical  size 
(see  Section  6.1)  to  be  placed  in  the  sio_offset_t  pointed  to  by  the 
op-data  field  of  the  sio_control_t. 

Result  Values 

SIO_SUCCESS 

The  operation  succeeded. 

SIO_ERR_INCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  operation. 

SIO_ERR_IO -FAILED 

A  physical  I/O  error  caused  the  operation  to  fail. 
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13.3  SIO_CTL_GetPreallocation, 
SIO_CTL_SetPreallocation 

Purpose 

Get  or  set  amount  of  space  preallocated  for  the  file. 

Affects 

Underlying  file. 

Parameter  Type 

Pointer  to  a  sio_offset_t. 

Description 

The  SIO_CTL_GetPreallocation  operation  causes  the  amount  of 
space  preallocated  (see  Section  6.1)  for  the  file  being  operated  on  to 
be  placed  in  the  sio_ofFset_t  pointed  to  by  the  op-data  field  of  the 

sio-ControLt. 

The  SIO_CTL_SetPreallocation  operation  causes  the  amount  of 
space  preallocated  for  the  file  being  operated  on  to  be  set  to  the  value 
in  the  sio_offset_t  pointed  to  by  the  op.data  field  of  the  sio_control_t. 
A  preallocation  applies  to  an  open  file  and  will  be  reset  to  zero  when 
the  file  is  closed.  While  open,  writes  by  other  tasks  that  extend  the 
physical  size  of  the  file  may  reduce  the  unconsumed  preallocation. 

If  either  the  SIO_CTL_GetPreallocation  operation  or  the 
SIO_CTL_SetPreallocation  operation  is  supported,  both  must  be 
supported. 

Result  Values 

SIO_SUCCESS 

The  operation  succeeded. 

SIO_ERRJNCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  operation. 

SIO_ERRJOJFAILED 

A  physical  I/O  error  caused  the  operation  to  fail. 


1714 
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SIO_ERR_NO_SPACE 

There  isn’t  enough  free  space  in  the  system  to  satisfy  the  request. 

SIO_ERR_OP_UNSUPPORTED 

The  operation  is  not  supported  by  the  system. 
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1719  13.4  SIO_CTL_GetCachingMode, 
1770  SIO_CTL_SetCachingMode 


1721  Purpose 

1722  Get  or  set  the  file’s  caching  mode. 

1723  Affects 

1724  File  descriptor. 


1725  Parameter  Type 

1726  Pointer  to  a  sio_caching_mode_t. 


1727 

1728 

1729 

1730 

1731 

1732 

1733 

1734 

1735 


Description 

The  SIO_CTL_GetCachingMode  operation  causes  the  caching 
mode  of  the  file  descriptor  to  be  placed  in  the  sio_caching_mode_t 
pointed  to  by  the  op-data  field  of  the  sio_control_t. 

The  SIO_CTL_SetCachingMode  operation  causes  the  caching  mode 
of  the  file  descriptor  to  be  set  to  the  value  of  the  sio_caching_mode_t 
pointed  to  by  the  op-data  field  of  the  sio_control_t.  SIO  implementa¬ 
tions  which  provide  support  for  multiple  caching  modes  may  elect  not 
to  provide  support  for  changing  the  caching  mode  of  an  open  file. 


1736  Result  Values 


1737 

1738 

1739 

1740 

1741 

1742 

1743 

1744 

1745 

1746 


SIO_SUCCESS 

The  operation  succeeded. 

SIO_ERRJNCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  operation. 

SIO_ERR_ONLY_AT_OPEN 

The  system  does  not  allow  the  caching  mode  of  an  open  file  to 
be  changed.  Caching  modes  can  only  be  changed  as  part  of 
sio_open(). 

SIO_ERR_OP_UNSUPPORTED 

The  system  does  not  support  SIO_CTL_SetCachingMode. 


13.5  SIO-CTL-Propagate 
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1747  13.5  SIO_CTLJPropagate 


1748  Purpose 

1749  Force  locally  cached  writes  to  be  made  visible  to  other  nodes. 

1750  Affects 

1751  Cached  writes  associated  with  file  descriptor. 

1752  Parameter  Type 

1753  Pointer  to  a  sio_file_io_list_t. 


1754 

1755 

1756 

1757 

1758 

1759 

1760 

1761 

1762 

1763 

1764 


Description 

This  operation  allows  a  task  to  force  the  parallel  file  system  to  make 
any  data  associated  with  a  particular  set  of  byte  ranges  visible  to  other 
nodes  in  the  system  (see  Section  12  for  information  about  why  this 
might  be  necessary),  as  specified  by  the  sio_file_io_list_t  pointed  to 
by  the  op-data  field  of  the  control  request.  If  op-data  is  NULL,  the 
propagation  will  apply  to  all  bytes  in  the  file.  If  the  size.,  stride  ,  and 
element-cnt  fields  of  the  sioJile_io_list_t  pointed  to  by  the  op.data 
field  are  all  zero,  then  the  set  of  bytes  to  be  propagated  begins  at  the 
offset  specified  in  the  offset  field  of  the  sio_file  Jo  Jist.t  and  continues 
until  the  end  of  the  file. 


1765 

1766 

1767 

1768 

1769 

1770 

1771 

1772 

1773 

1774 

1775 


This  operation  only  affects  those  bytes  written  via  the  given  file  descrip¬ 
tor;  if  an  application  writes  to  a  file  using  more  than  one  file  descriptor, 
it  must  perform  a  propagate  operation  on  each  of  them  to  guarantee 
the  dirty  data  become  visible  to  other  nodes.  While  it  is  guaranteed 
after  a  propagate  operation  completes  that  all  locally  cached  writes  for 
the  specified  file  regions  have  been  exposed  to  the  rest  of  the  file  sys¬ 
tem,  it  is  not  guaranteed  that  some  or  all  the  changed  data  was  not 
visible  to  the  rest  of  the  file  system  prior  to  the  propagate.  That  is, 
weakly-consistent  client  caching  implies  only  that  cached  writes  will  be 
exposed  to  the  rest  of  the  file  system  no  later  than  the  completion  of 
the  propagate  operation. 
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Result  Values 
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SIO_SUCCESS 

The  results  of  all  writes  on  this  file  descriptor  in  the  specified 
region(s)  have  been  exposed  to  the  rest  of  the  file  system. 

SIO_ERRJNVALID_FILE_LIST 

op-data  is  not  NULL  nor  a  pointer  to  a  valid  sio_fileJo  Jist_t. 


13.6  SIO-CTL-Refresh 
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1782  13.6  SIO-CTL  Jlefresh 


1783  Purpose 

1784  Inform  the  file  system  that  locally  cached  data  may  be  invalid. 

1785  Affects 

1786  Blocks  in  client’s  cache  containing  data  for  this  file. 

1787  Parameter  Type 
Pointer  to  a  sio_fileJo_list_t. 

Description 

This  operation  informs  the  parallel  file  system  that  data  cached  for 
a  file  may  be  stale,  that  is,  superseded  by  more  recent  writes  (see 
Section  12  for  information  about  why  this  might  be  necessary).  Future 
reads  to  the  specified  client  region(s)  are  guaranteed  to  not  yield  data 
that  were  stale  at  the  time  the  refresh  operation  began. File  region(s) 
are  specified  by  the  sio_fileJoJist_t  pointed  to  by  the  op-data  field 
of  the  control  request.  If  op-data  is  NULL,  the  operation  will  apply 
to  all  bytes  in  the  file.  If  the  size,  stride,  and  element-cnt  fields  of  the 
sio_file  Jo_list_t  pointed  to  by  the  op_data  field  are  all  zero  then  the 
operation  applies  to  the  set  of  bytes  beginning  at  the  offset  specified 
in  the  offset  field  of  the  sio_file  Jo  Jist.t  and  ending  at  the  end  of  the 
file. 

1802  Result  Values 


1789 

1790 

1791 

1792 

1793 

1794 

1795 

1796 

1797 

1798 

1799 

1800 
1801 


SIO_SUCCESS 

The  regions  have  been  refreshed. 

SIO^ERR_INVALID_FILE_LIST 

op^data  is  not  NULL  or  a  pointer  to  a  valid  sio_file Jo_list_t. 


^^The  file  system  may  satisfy  this  requirement  by  explicitly  validating  all  cached  data  in 
the  specified  region (s)  with  the  server,  or  by  ejecting  the  specified  blocks  from  the  cache 
entirely. 


90 


13  CONTROL  OPERATIONS 


1807  13.7  SIO_CTL_Sync 


1808  Purpose 

1809  Force  dirty  data  to  stable  storage. 

1810  Affects 

1811  Blocks  written  via  the  file  descriptor. 

1812  Parameter  Type 

1813  None 


1814 

1815 

1816 

1817 

1818 

1819 

1820 


Description 

This  operation  causes  all  dirty  blocks  associated  with  the  file  descriptor 
to  be  written  to  stable  storage.  The  meaning  of  “stable  storage”  is 
implementation  specific  -  it  may  be  the  disk,  non-volatile  memory,  or 
another  mechanism  that  provides  greater  reliability  than  the  volatile 
memory  in  the  node  caching  the  blocks.  SIO_CTL_Sync  performs  a 
superset  the  operations  performed  by  SIO_CTL_Propagate. 


1821  Result  Values 


1822  SIO_SUCCESS 

1823  The  operation  succeeded. 

1824  SIO_ERRJO_FAILED 

A  physical  I/O  error  caused  the  operation  to  fail. 


1825 
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1826  13.8  SIO_CTL_GetLayout,  SIO_CTL_SetLayout 


1827  Purpose 

1828  Get  or  set  the  layout  of  the  file  data  on  the  storage  system. 


1829  Affects 

1830  Underlying  file. 

1831  Parameter  Type 

1832  Pointer  to  a  sioJayout.t. 


1833  Description 

1834  These  operations  allow  the  layout  of  a  file’s  data  on  the  underlying 

1836  storage  system  to  be  queried  and  modified. 

1836  The  control  SIO_CTL_GetLayout  will  return  the  layout  for  the  un- 

1837  derlying  file,  while  SIO_CTL_SetLayout  will  set  the  layout,  if  pos- 

1838  sible.  Implementations  may  choose  to  ignore  SIO_CTL_SetLayout 

1839  entirely,  returning  SIO_ERR_OP_UNSUPPORTED. 

1840  Result  Values 


1841 

1842 

1843 

1844 

1845 

1846 

1847 

1848 

1849 


SIO_SUCCESS 

The  operation  succeeded. 

SIO_ERR_OP_ONLY_AT_CREATE 

The  implementation  only  supports  SIO_CTL_SetLayout  when 
a  file  is  being  created. 

SIO_ERRJNCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  operation. 

SIO_ERR_OP_UNSUPPORTED 

The  operation  is  not  supported  by  the  system. 
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1854 

1855 

1856 

1857 

1858 

1859 
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13.9  SIO_CTL_GetLabel,  SIO_CTL_SetLabel 

Purpose 

Get  or  set  the  file’s  label. 

Affects 

Underlying  file. 

Parameter  Type 

Pointer  to  a  sio_label_t. 

Description 

These  operations  allow  the  label  associated  with  a  file  to  be  set  and 
retrieved.  A  file’s  label  is  not  interpreted  by  the  file  system.  The  intent 
is  for  applications  to  store  descriptive  information  about  a  file  in  the  a 
file’s  label,  rather  than  in  the  file  itself.  That  removes  the  need  for  file 
headers  and  the  inefficiencies  that  go  with  them. 

The  maximum  size  of  a  file’s  label  is  SIO_MAX_LABEL_LEN,  the 
value  of  which  is  implementation-specific.  It  is  guaranteed,  however, 
to  be  at  least  as  big  as  SIO_MAX_NAME_LEN,  allowing  any  legal 
file  name  to  fit  in  a  label.  This  allows  descriptive  information  that  is 
too  large  to  fit  in  a  label  to  be  stored  in  an  auxiliary  file  whose  name 
can  be  stored  in  the  label  of  the  file  being  described. 

For  descriptive  labels  to  be  portable  across  implementations 
they  must  be  no  larger  than  the  minimum  allowed  value  for 

SIOJVIAX_LABEL_LEN. 

When  performing  SIO_CTL_SetLabel,  the  data  field  of  the 
sio_label_t  must  contain  a  pointer  to  a  buffer,  the  length  field  must 
contain  the  length  of  that  buffer.  If  the  length  given  is  greater  than 
SIOJVIAX_LABEL_LEN,  SIO_ERR_INVALID_LABEL  will  be 
returned  and  the  operation  will  fail.  After  a  SIO_CTL_SetLabel  op¬ 
eration  successfully  completes,  the  length  of  the  file’s  label  will  be  equal 
to  length,  and  the  file’s  label  data  will  be  the  same  as  the  contents  of 
the  buffer. 
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When  performing  SIO_CTL_GetLabel,  the  data  field  of  the 
sioJabel_t  must  contain  a  pointer  to  a  buffer  to  be  filled  in  with 
the  file’s  current  label  data,  and  the  length  field  must  contain  the  size 
of  that  buffer.  If  the  buffer  is  too  small  to  contain  the  label,  the 
SIO_ERRJNVALID_LABEL  error  code  will  be  returned,  length 
will  be  set  to  the  actual  length  of  the  label,  and  the  contents  of  the 
data  buffer  will  be  unspecified.  If  the  buffer  is  at  least  as  large  as 
the  current  file  label,  SIO_SUCCESS  will  be  returned,  length  will 
be  set  to  the  actual  length  of  the  label  (as  set  by  a  previous  call  to 
SIO_CTL_SetLabel,  or  to  zero  if  the  file’s  label  has  never  been  set), 
and  the  data  buffer  will  be  filled  with  that  many  bytes  of  label  data. 
If  the  buffer  is  larger  than  the  label,  the  contents  of  the  bytes  in  the 
buffer  following  the  label  are  unspecified. 


1893  Result  Values 


1894 

1895 

1896 

1897 

1898 

1899 

1900 

1901 

1902 

1903 

1904 

1905 


SIO_SUCCESS 

The  operation  succeeded. 

SIO_ERR_INCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  operation. 

SIO_ERR_INVALID_LABEL 

The  length  of  the  new  label  being  set  exceeds 
SIO_MAX_LABEL_LEN,  or  the  length  of  the  label  being  re¬ 
trieved  exceeds  the  size  of  the  application-provided  buffer. 

SIO_ERRJO -FAILED 

A  physical  I/O  error  caused  the  operation  to  fail. 

SIO_ERR_NO_SPACE 

The  system  needs  to  increase  the  amount  of  storage  used  by  the 
file  but  cannot. 


1906 
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1907  13.10  SIO_CTL_GetConsistencyUnit 

1908  Purpose 

1909  Get  the  size  of  the  cache  consistency  unit. 

1910  Affects 

1911  File  system. 

1912  Parameter  Type 

1913  Pointer  to  a  sio_size_t. 

1914  Description 

1915  This  operation  returns  the  size  of  the  cache  consistency  unit.  The 

1916  consistency  unit  defines  the  granularity  of  cache  consistency  under  weak 

1917  caching,  as  described  in  Section  12. 

1918  Result  Values 

1919  SIO_SUCCESS 

1920  The  operation  succeeded. 
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1921  14  Extension  Support 


1922  Support  for  querying  the  presence  of  extensions  is  part  of  the  basic  API, 

1923  and  must  be  implemented  by  all  conforming  implementations,  even  if  no 

1924  extensions  are  supported  by  an  implementation.  Applications  may  deter- 

1925  mine  either  statically  (described  in  Section  14.1.1)  or  dynamically  (via  the 

1926  sio_query_extension()  function,  described  in  Section  14.2)  whether  or  not 

1927  an  extension  is  supported  by  the  implementation  of  the  API.  Sample  code 

1928  indicating  the  proper  way  to  check  for  the  presence  of  extensions  is  included 

1929  in  Sectionl4.3. 
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1930  14.1  Static  Constants 

1931  14.1.1  Extension  Support  Constants 

1932  Applications  may  statically  determine  via  constants  which  extensions  are 

1933  supported  by  a  given  implementation.  For  each  extension  that  an  implemen- 

1934  tation  is  capable  of  supporting,  the  implementation  should  define  a  constant 

1935  which  indicates  that  the  extension  is  supported,  that  it  is  not,  or  that  the 

1936  support  status  cannot  be  determined  during  compilation.  These  constants 

1937  are  of  the  form  SIO_EXT_iVAME_SUPPORTED,  where  NAME  is  the 

1938  name  of  the  extension.  Each  of  these  constants  must  be  set  to  one  of  the 

1939  following  values: 

1940  SIO_EXT_ABSENT  (equal  to  0)  The  extension  is  not  supported. 

1941  SIO_EXT_PRESENT  The  extension  is  supported. 

1942  SIO-EXT-MAYBE  The  extension  might  be  supported.  A  dynamic  check 

1943  must  be  used  to  make  a  final  determination. 

1944  The  SIO -EXT -ABSENT  constant  must  be  zero  so  that  existence  of  exten- 

1945  sions  which  the  implementation  is  completely  unaware  of  can  be  checked. 

1946  The  values  of  the  other  constants  are  unspecified. 

1947  If  the  static  constant  for  an  extension  is  equal  to  SIO_EXT_ABSENT, 

1948  then  the  application  cannot  depend  on  any  of  the  functions  or  definitions 

1949  that  are  a  part  of  the  extension  (including  the  extension  ID)  being  present. 

1950  If  the  static  constant  is  SIO_EXT_PRESENT  or  SIO_EXT_MAYBE, 

1951  then  the  functions  and  definitions  that  are  a  part  of  the  extension  will  be 

1952  present.  In  the  case  of  SIO_EXT -MAYBE,  the  functions  and  definitions 

1953  may  be  usable  only  if  the  extension  is  determined  to  be  available  at  run-time. 

1954  The  definition  of  SIO_EXT_ABSENT  allows  for  implementations  to  con- 

1955  form  to  this  API  without  requiring  updates  for  any  new  extensions  which  may 

1956  be  added  in  the  future.  The  SIO_EXT_MAYBE  value  allows  for  binary 

1957  compatibility  across  different  versions  of  an  implementation  that  support 

1958  different  sets  of  extensions. 

^^The  C  preprocessor  will  expand  an  unknown  definition  as  zero  when  used  in  pre¬ 
processor  directives,  and  this  allows  undefined  extension  support  macros  to  match 
SIO-EXT-ABSENT. 


14.1  Static  Constants 
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1959  14.1,2  Extension  Identifiers 

1960  Extension  identifiers  are  constants  of  the  form  SIO-EXT-AfAME,  where 

1961  NAME  is  the  name  of  the  extension.  Extension  identifiers  with  names  of 

1962  the  form  SIO  JEXT-VEND-iV^ME  are  reserved  for  use  by  vendors,  and  all 

1963  other  extension  names  are  reserved  for  future  use  by  this  API. 

1964  An  implementation  must  define  an  extension  identifier  for  each  extension 

1965  which  is  supported  or  may  be  supported  by  that  implementation  as  deter- 

1966  mined  by  the  value  of  the  extension’s  SIO_EXT_  iVAME^-SUPPORTED 

1967  constant  described  in  Section  14.1.1.  Extension  identifiers  can  be  given  to 

1968  sio_query_extension()  to  check  whether  or  not  the  extensions  in  question 

1969  are  actually  available. 


is  not  necessary  to  call  sio_query_extension()  for  extensions  whose  exten¬ 
sion  support  constants  indicate  that  they  are  present,  but  it  is  safe  to  do  so  and 
sio_query^extension()  must  indicate  that  those  extensions  are  supported. 
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1970 


14.2  sio_query_extension 


1973 

1974 

1975 

1976 

1977 

1978 

1979 

1980 

1981 

1982 


1984 

1985 

1986 

1987 


Purpose 

Determine  whether  or  not  an  extension  is  supported. 

Syntax 

^include  <sio_fs.h> 

sio_return_t  sio_query_extension(sio_extensionJd_t  ExtID); 
Parameters 

ExtID  Extension  identifier  of  extension  being  queried. 

Description 

This  function  takes  an  extension  identifier  and  returns 
SIO_SUCCESS  if  the  extension  is  supported  by  this  implementa¬ 
tion,  or  SIO-ERRJN  VALID -EXTENSION  if  the  extension  is  not 
supported,  or  if  the  identifier  is  not  recognized  as  valid. 

Return  Codes 

SIO-SUCCESS 

The  extension  is  supported  by  the  implementation. 

SIO-ERR-INVALID -EXTENSION 

ExtID  contains  an  invalid  or  unsupported  extension  ID. 


14.3  Sample  Code  to  Check  for  Extension  Presence 
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1988  14.3  Sample  Code  to  Check  for  Extension  Presence 

1989  A  code  fragment  which  queries  the  presence  of  an  extension  might  look  like: 


1990 

1991 


int  f ooext_is_present ; 
sio_return_t  rc; 


1992 

1993 

1994 

1995 

1996 

1997 

1998 

1999 

2000 
2001 
2002 

2003 

2004 

2005 

2006 

2007 

2008 

2009 

2010 


#if  SI0_EXT_F00_SUPP0RTED  ==  S I 0_EXT_ ABSENT 
f ooext_is_present  =  0; 

#elif  SIO_EXT_FOO_SUPPORTED  ==  SIO.EXT.PRESENT 
f ooext_is_present  =  1; 

#else  /*  SI0_EXT_F00.SUPP0RTED  ==  SIO.EXT.MAYBE  */ 
rc  =  sio_query_extension(SI0_EXT_F00) ; 
switch (rc)  { 

case  SI0_SUCCESS: 

fooext_is_present  =  1; 
brecik; 

case  SI0_ERR_INVALID_EXTENSI0N: 
f ooext_is_present  =  0; 
breedc; 
default : 

f ooext_is_present  =  0; 

printf ("can't  determine  if  extension  foo  is  present  (y.s)\n" 
sio_error_string(rc)) ; 

} 

#endif  /*  SI0_EXT_F00_SUPP0RTED  ==  SIO.EXT.ABSENT  */ 
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2012 

2013 

2014 


2015 

2016 

2017 

2018 

2019 

2020 
2021 
2022 

2023 

2024 

2025 

2026 


2027 

2028 

2029 

2030 

2031 

2032 

2033 

2034 

2035 

2036 

2037 

2038 


15  Extension:  Collective  I/O 

static  Constant:  SIO_EXT_COLLECTIVE_SUPPORTED 
Extension  ID:  SIO_EXT_COLLECTIVE 

15.1  Motivation 

As  demonstrated  by  Kotz  et  al.,  collective  I/O  allows  for  a  distributed  batch¬ 
ing  process  which  can  greatly  enhance  I/O  performance  in  a  parallel  file  sys¬ 
tem.  Semantically,  by  declaring  an  I/O  or  set  of  I/Os  to  be  part  of  a  single, 
collective  I/O,  an  application  is  indicating  to  the  file  system  that  the  relative 
ordering  of  the  components  of  the  collective  I/O  is  irrelevant,  since  no  por¬ 
tion  of  the  application  awaiting  a  component  of  the  collective  I/O  can  make 
any  progress  until  the  entirety  of  the  collective  I/O  completes.  File  systems 
can  take  advantage  of  this  to  drastically  reorder  I/O  components  to  reduce 
overall  latency,  at  the  potential  cost  of  increasing  the  latency  of  component 
I/Os  (the  constraint  which  prevents  this  optimization  from  occurring  in  the 
standard  case). 

15.2  High  Level  Look 

To  initiate  a  collective  I/O  one  task  of  the  application  requests  that  a  new 
collective  I/O  handle  be  created.  This  is  what  we  refer  to  as  “defining”  the 
collective  I/O.  At  this  time,  the  application  indicates  the  number  of  partic¬ 
ipants,  whether  the  collective  I/O  is  a  read  or  write  operation  (we  do  not 
allow  collective  mixed  read/writes),  the  number  of  iterations  of  the  collective 
I/O,  and  optionally  indicates  what  portions  of  the  file  will  be  operated  on. 
Specification  of  file  regions  at  define  time  provides  (ordered)  file  access  hints 
which,  if  properly  given,  allow  the  file  system  to  implement  performance 
optimizations. 

Each  participant  “joins”  aji  iteration  of  the  collective  I/O  by  providing  the 
handle  created  by  the  define  operation,  the  file  descriptor,  the  portions  of 
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the  file  they  wish  to  read  or  write,  the  source/destination  memory  locations, 
their  participant  identifier,  and  a  sequence  number  indicating  which  iteration 
of  the  collective  I/O  they  are  joining. 

Note  that  the  application  will  generally  need  to  pass  the  handle  from  the 
task  that  defined  the  collective  I/O  to  any  other  tasks  that  participate  in  the 
I/O.  A  single  task  may  participate  multiple  times  in  a  given  collective  I/O 
iteration  by  joining  that  iteration  multiple  times  using  different  participant 
numbers.  Prior  to  joining  a  collective  I/O  operation,  a  task  must  open  the 
file  being  accessed  so  a  file  descriptor  for  the  file  is  available  for  use  with  the 
join  call. 
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15.3  New  Data  Types 

15.3.1  sio_coll_handle_t 

This  is  a  64-bit  integral  type  used  as  an  abstract  handle  to  represent  a  col¬ 
lective  I/O.  We  explicitly  define  the  format  and  size  of  this  datatype  because 
applications  will  need  to  use  their  own  communications  mechanisms  to  pass 
these  among  tasks  on  different  nodes,  and  therefore  need  to  be  aware  of  size 
and  network  ordering  issues. 


15.3.2  sio_coll_participant_t 

This  is  an  unsigned  integral  type  with  the  range 
[0. . .  SIO _MAX_COLL_PARTICIPANTS]  which  is  used  in  the  definition 
of  a  collective  I/O  operation  to  specify  the  number  of  participants,  and  in 
the  collective  I/O  join  to  identify  the  participant  joining  the  collective  I/O 
iteration. 

These  values  have  no  meaning  or  permanence  beyond  the  collective  I/O  in 
which  they  are  used. 


15.3.3  sio_collJteration_t 

This  is  an  unsigned  integral  type  with  the  range 
[0. . .  SIOJVIAX_COLL_ITERATIONS]  which  is  used  in  the  definition  of 
a  collective  I/O  operation  to  specify  the  number  of  iterations,  and  in  the 
collective  I/O  join  to  identify  the  iteration  being  joined. 
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15.4  New  Range  Constants 

15.4.1  SIO_MAX_COLL_ITERATIONS 

This  constant  specifies  the  maximum  number  of  iterations  that  a  collective 
I/O  can  describe.  The  minimum  value  is  1,  and  the  recommended  value  is 
128. 

15.4.2  SIO_MAX_COLL_PARTICIPANTS 

This  constant  specifies  the  maximum  number  of  participants  that  can  take 
part  in  a  collective  I/O.  The  minimum  value  is  16,  but  the  recommended 
value  is  at  least  256. 


15.4.3  SIO_MAX_COLL_OUTSTANDING 

This  constant  specifies  the  maximum  number  of  outstanding  collective  I/O 
requests  that  one  task  can  have  at  any  given  time.  The  minimum  value  is  1, 
and  the  recommended  value  is  at  least  512. 
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15.5  New  Functions 

Two  new  functions  are  added  by  the  collective  I/O  extension: 
sio_coll_define()  and  sio_coll_join(),  which  are  described  in  Sections  15.5.1 
and  15.5.2,  respectively. 
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15.5.1  sio_coll_define 


Purpose 

Define  a  new  collective  I/O  and  get  a  handle  to  refer  to  it. 

Syntax 

#include  <sio_fs.h> 

sio_return_t  sio_coll_define(int  FileDescriptor, 

sio_collJteration_t  NumIterations, 
const  sio_fileJo_list_t  *FileList^ 
sio_count_t  FileListLength, 
sio_size_t  IterationStride, 
sio_mode_t  ReadWrite, 
sio_colLparticipant_t  NumParticipants, 
sio_coll_handle_t  *  Handle)-^ 


Parameters 

FileDescriptor  The  file  descriptor  of  an  open  file. 

NumIterations  The  number  of  times  the  collective  I/O  will  be  repeated. 

FileList  Specification  of  file  data  to  be  read  or  written. 

FileListLength  Number  of  elements  in  FileList.  This  may  be  zero. 

IterationStride  A  value  that  modifies  the  location  of  the  file  data  to  be 
read  or  written  as  specified  in  FileList  based  on  the  iteration  in 
progress. 

ReadWrite  One  of  SIO_MODE_READ  or  SIO_MODE_WRITE. 

NumParticipants  The  number  of  participants  in  each  iteration  of  the 
collective  I/O. 

Handle  On  success,  returns  the  handle  of  the  newly-defined  collective 
I/O. 

Description 

This  interface  creates  a  new  handle  for  a  collective  I/O,  and  returns  it  in 

Handle.  The  NumIterations  parameter  indicates  the  number  of  times 
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the  collective  I/O  will  be  performed.  The  application  programmer  may 
choose  to  disclose  the  portions  of  the  file  which  will  be  affected  in 
FileList,  or  FileListLength  may  be  zero  in  which  case  the  file  system 
must  wait  for  a  participant  to  call  sio_coIl_join()  before  its  workload 
is  known. 

In  cases  where  the  collective  I/O  will  be  performed  more  than  once  and 
the  application  programmer  indicates  what  portions  of  the  file  will  be 
operated  on,  it  is  often  true  that  the  access  patterns  for  each  iteration 
are  identical  except  for  their  offsets  from  the  beginning  of  the  file, 
and  that  the  offsets  are  based  on  the  iteration  being  performed.  The 
IterationStride  parameter  lets  the  programmer  express  these  common 
cases  without  having  to  separate  them  into  individual  collective  I/O 
operations.  If  i  is  the  iteration  number  (starting  at  iteration  0),  the 
offset  field  in  each  sio_file  Jo_list_t  structure  of  the  FileList  parameter 
would  have  the  value: 

offset^  =  offset^  +  (i  x  IterationStride) 

For  example,  if  FileList  has  two  entries  with  the  values  {offset=0, 
size— 2,  stride=3,  element-cnt=4)  and  [offset=100,  size=5,  stride=0, 
element.cnt=l),  the  programmer  is  hinting  that  the  first  iteration  will 
access  bytes  (0,  1,  3,  4,  6,  7,  9,  10,  100,  101,  103,  104,  105)  in  the 
file.  If  IterationStride  is  zero,  the  second  iteration  will  access  the  same 
bytes.  However,  if  IterationStride  is  50,  the  second  iteration  will  access 
bytes  (50,  51,  53,  54,  56,  57,  59,  60,  150,  151,  153,  154,  155)  -  the 
offset  components  of  the  FileList  structures  are  adjusted  based  on  the 
iteration  (1)  and  the  IterationStride  (50). 

Note  that  sio_coll_join()  must  always  be  called  by  each  participant 
and  must  provide  a  FileList  for  that  participant’s  portion  of  the  col¬ 
lective  I/O,  whether  or  not  FileListLength  is  zero  in  sio_coll_define(). 
Providing  a  description  of  the  entire  operation  in  FileList  simply  pro¬ 
vides  a  way  for  the  file  system  to  optimize  scheduling  of  the  transfer. 

Return  Codes 

SIO_SUCCESS 

The  function  succeeded. 
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SIO_ERRJNCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  I/O. 

SIO_ERRJNVALID  .DESCRIPTOR 

FileDescriptor  does  not  refer  to  a  valid  file  descriptor  created  by 
sio_open(), 

SIO_ERRJNVALID_FILEJLIST 

The  file  regions  described  by  FileList  are  invalid,  e.g.  they  contain 
illegal  offsets. 

SIO_ERRJVIAX_COLL_ITERATIONSJEXCEEDED 

The  number  of  iterations  described 

by  Numlterations  exceeds  the  maximum  allowed  as  defined  by 

SIO_MAX_COLLJ[TERATIONS. 

SIO_ERRJVIAX_COLL_PARTICIPANTS_EXCEEDED 

The  number  of  participants  described 

by  NumParticipants  exceeds  the  maximum  allowed  as  defined  by 

SIO_M  AX_COLL_PARTICIPANTS . 
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15.5.2  sio-ColLjoin 


Purpose 

Initiate  an  asynchronous  transfer  as  part  of  a  collective  I/O. 

Syntax 

9^include  <sio_fs.h> 

sio_return_t  sio_coll_join(int  FileDescriptor, 

sio_coll_handle_t  Handle, 
sio_coll_participant_t  Participant, 
sio_collJteration_t  Iteration, 
const  sio_fileJoJist_t  *FileList, 
sio_count_t  FileListLength, 
const  sio_mem  jo_list_t  * MemoryList, 
sio_count_t  MemoryListLength, 
sio_async_handle_t  *AsyncHandle)] 


Parameters 

FileDescriptor  The  file  descriptor  of  the  open  file  where  the  collective 
I/O  is  being  performed. 

Handle  The  handle  provided  by  sio_coll_define()  for  this  collective 
operation. 

Participant  The  identifier  for  this  participant.  This  is  a  number  in  the 
range  [0. . .  [NumParticipants  —  1)],  where  NumParticipants  is  the 
number  of  participants  that  was  provided  to  sio_coll_define(). 

Iteration  Which  iteration  of  the  collective  I/O  the  participant  is  joining. 

FileList  Specification  of  file  data  to  be  read  or  written  by  this  partici¬ 
pant. 

FileListLength  Number  of  elements  in  FileList. 

MemoryList  Memory  locations  read  or  written  by  this  I/O  component. 

MemoryListLength  Number  of  elements  in  MemoryList. 

AsyncHandle  Handle  returned  by  the  operation,  which  can  be  used 
later  to  determine  the  status  of  the  I/O,  to  wait  for  its  completion, 
or  to  cancel  it. 
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Description 

This  interface  initiates  a  component  of  a  collective  I/O.  At  this  point, 
the  file  system  may  immediately  begin  transferring  data  to  or  from 
these  memory  locations,  or  it  may  choose  to  wait  for  other  participants 
to  join  the  collective  I/O.  The  number  of  participants  in  each  itera¬ 
tion  of  the  collective  I/O  must  equal  the  NumParticipants  specified  to 
sio_coll_define(),  i.e.  sio_coll_join()  must  be  called  NumParticipants 
times  for  each  iteration.  sio_coll_join()  returns  immediately  and 
sio_async_status_any()  or  sio_async_cancel_all()  must  be  called 
with  the  AsyncHandle  to  complete  or  cancel  the  operation. 

Note  that  calls  to  sio_async_status_any()  or  sio_async_cancel_all() 
reflect  only  this  participant’s  portion  of  this  iteration  of  the  collec¬ 
tive  I/O,  as  identified  by  the  value  of  AsyncHandle.  Also,  calls  to 
the  sio_async_status_any()  and  sio_async_cancel_all()  may  con¬ 
tain  multiple  AsyncHandles,  but  the  AsyncHandles  returned  by  the 
sio_coll_join()  may  not  be  mixed  with  AsyncHandles  returned  by 
sio_async_sg_read()  or  sio_async_sg_write()  functions  in  the  same 
call. 
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To  clarify  some  of  the  parameters  a  bit  further,  the  FileDescriptor 
parameter  must  refer  to  the  same  file  as  was  specified  by  the 
FileDescriptor  in  the  sio_coll_define()  for  this  collective  operation. 
However,  the  actual  FileDescriptor  value  may  differ  from  the  one  in  the 
sio_coll_define()  because  the  task  making  the  join  call  may  be  differ¬ 
ent  from  the  task  that  defined  the  collective  operation. 

If  the  sio_coIl_define()  for  this  collective  operation  contained  infor¬ 
mation  about  the  bytes  that  would  be  accessed  in  its  FileList  param¬ 
eter,  then  to  realize  performance  gains  the  FileList  parameter  in  this 
sio_colLjoin()  call  should  contain  bytes  that  appeared  in  the  original 
sio_coll_define()  FileList  parameter.  If  this  is  not  the  case,  or  if  the 
sio_coll_define()  did  not  contain  file  region  information,  the  bytes 
specified  in  the  sio_coll_join()  FileList  parameter  will  still  be  read  or 
written,  but  potentially  with  poorer  performance. 

Finally,  note  that  there  is  no  parameter  in  the  sio_coll_join()  call 
corresponding  to  the  sio_coll_define()  parameter  IterationStride.  In 
the  join,  it  is  the  responsibility  of  the  application  programmer  to  adjust 
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the  FileList  offset  values  as  appropriate  for  the  iteration  being  joined. 
Return  Codes 

SIO_SUCCESS 

The  function  succeeded. 

SIO_ERR_INCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  I/O. 

SIO_ERR_INVALID_DESCRIPTOR 

FileDescriptor  does  not  refer  to  a  valid  file  descriptor  cre¬ 
ated  by  sio_open(),  or  does  not  refer  to  the  file  specified  to 
sio_call_define()  when  the  collective  I/O  was  created. 

SIO_ERR_INVALID_FILEXIST 

The  file  regions  described  by  FileList  are  invalid,  e.g.  they  contain 
illegal  offsets.  Implementations  may  defer  returning  this  error 
until  sio_async_status_any()  is  invoked  on  the  I/O. 

SIO_ERR_INVALID_HANDLE 

Handle  is  not  the  handle  for  a  collective  I/O. 

SIOJERRJNVALID-ITERATION 

Iteration  is  not  valid,  either  because  it  is  greater  than  the  num¬ 
ber  of  iterations  specified  when  the  collective  1/ 0  was  created  or 
between  the  task  already  joined  that  iteration  of  the  I/O. 

SIO_ERRJNVALID_MEMORY_LIST 

The  memory  regions  described  by  MemoryList  are  invalid,  e.g. 
they  contain  illegal  addresses.  Implementations  may  defer  return¬ 
ing  this  error  until  sio_async_status_any()  is  invoked  on  the 
I/O. 

SIO_ERR_INVALID_PARTICIPANT 

Participant  is  not  valid  because  it  is  greater  than  the  number  of 
participants  specified  when  the  collective  I/O  was  created. 

SIO_ERR_MAX_ASYNC_OUTSTANDING_EXCEEDED 

The  I/O  request  could  not  be  initiated  because  doing  so  would 
cause  the  calling  task’s  number  of  outstanding  asynchronous  I/Os 
to  exceed  the  limit. 
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SIO_ERR_UNEQUAL_LISTS 

The  number  of  bytes  in  MemoryList  and  FileList  are  not 
equal.  Implementations  may  defer  returning  this  error  until 
sio_async_status_any()  is  invoked  on  the  I/O. 
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2266  16  Extension:  Fast  Copy 

2267  Static  Constant:  SIO_EXT_FAST_COPY_SUPPORTED 

2268  Extension  ID:  SIO_EXT_FAST_COPY 

2269  This  extension  provides  a  low-level  versioning  mechanism  by  allowing  an 

2270  efidcient  “snapshot”  of  a  file’s  current  contents  to  be  created.  This  is  done 

2271  via  the  sio_control()  operation  SIO_CTL_FastCopy. 

2272  The  SIO_CTL_FastCopy  control  operation  creates  snapshots  by  replacing 

2273  the  contents  of  a  parallel  file  (created  and  opened  with  sio_open()),  with 

2274  the  contents  of  the  file  being  duplicated.  Since  snapshots  are  normal  parallel 

2275  files,  they  can  be  accessed  in  all  of  the  ways  that  parallel  files  can  be  accessed. 

2276  That  is,  snapshots  created  by  SIO_CTL_FastCopy  can  be  read,  written, 

2277  operated  on  by  controls,  etc. 

2278  If  a  higher-level  file  system  library  is  using  SIO_CTL_FastCopy  to  pro- 

2279  vide  versioning  support,  that  library  is  responsible  for  managing  the 

2280  translation  between  its  notion  of  versions  and  that  provided  by  the 

2281  SIO_CTL_FastCopy  mechanism.  For  instance,  the  higher-level  library 

2282  must  translate  between  the  file  name  and  version  number  that  the  appli- 

2283  cation  supplies  and  the  actual  parallel  name  for  that  snapshot.  The  higher- 

2284  level  library  must  also  enforce  its  own  version  reference  semantics  (perhaps 

2285  preventing  write  access  to  old  versions  of  the  file,  or  taking  other  actions  as 
necessary). 
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16.1  SIO_CTL_FastCopy 

Purpose 

Efficiently  copy  the  contents  of  one  file  into  another. 

Affects 

Underlying  file. 

Parameter  Type 

Pointer  to  an  int  which  is  a  file  descriptor  for  the  open  parallel  file  to 
be  used  as  the  source  of  the  efficient  copy  operation. 

Description 

This  operation  performs  an  efficient  copy  of  the  contents  of  one  par¬ 
allel  file  into  another.  The  source  file  descriptor  is  specified  by  the 
int  pointed  to  by  the  op-data  member  of  the  sio_control_t.  The  desti¬ 
nation  file  is  specified  by  the  Name  argument  to  sio_open()  or  by  the 
FileDescriptor  argument  to  sio_control(). 

The  implementation  of  the  efficient  copy  operation  performed  by  this 
function  is  intended  to  use  copy-on-write  or  similar  techniques  to  min¬ 
imize  data  dnplication. 

If  the  SIO_CTL_FastCopy  operation  fails  or  is  not  supported,  an 
error  will  be  returned  and  the  source  and  destination  files  will  be  un¬ 
modified. 

Effects  of  Successful  Operation  on  the  Source  File 

The  source  file’s  data  are  unmodified  by  the  SIO_CTL_FastCopy 
operation. 

The  source  file’s  physical  size  at  the  conclusion  of  the 
SIO_CTL_FastCopy  operation  is  unspecified. 

None  of  the  source  file’s  other  file  or  file  descriptor  attributes  (as  defined 
by  this  API)  are  modified  by  the  SIO_CTL_FastCopy  operation. 

If  vendors  define  new  attribntes,  the  effect  of  SIO_CTL_FastCopy  on 
the  source  file  with  respect  to  those  attributes  should  be  specified. 
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Hints  about  expected  use  of  the  source  file  are  unmodified  by  the 
SIO_CTL_FastCopy  operation. 

Effects  of  Successful  Operation  on  the  Destination  File 

The  destination  file’s  logical  size  is  set  to  the  source  file’s  logical 
size,  and  the  destination  file’s  contents  are  made  to  appear  identical 
(e.g.  if  accessed  with  sio_sg_read())  to  those  of  the  source  file.  If 
SIO_CTL_SetSize  is  specified  in  the  same  set  of  control  operations  as 
SIO_CTL_FastCopy,  the  resulting  size  of  the  destination  file  is  un¬ 
defined. 

The  destination  file’s  physical  size  at  the  conclusion  of  the 
SIO_CTL_FastCopy  operation  is  unspecified. 

The  destination  file’s  label  is  made  identical  to  the  source  file’s  label. 

The  destination  file’s  other  file  attributes  (preallocation  and  layout)  are 
not  affected. 

None  of  the  destination’s  file  descriptor  attributes  (caching  mode  and 
consistency  unit)  are  affected.  Note  that  if  a  weak  client  caching  mode 
is  in  use  on  the  destination  file,  the  destination  file’s  new  contents  may 
need  to  be  propagated  (with  SIO_CTL_Propagate)  before  they  can 
be  used  by  other  clients. 

If  vendors  define  new  attributes,  the  effect  of  SIO_CTL_FastCopy  on 
the  destination  file  with  respect  to  those  attributes  should  be  specified. 

The  effect  of  the  SIO_CTL_FastCopy  operation  on  hints  about  ex¬ 
pected  use  of  the  destination  file  is  unspecified.  Portable  applications 
or  libraries  that  wish  to  hint  about  future  accesses  to  the  destination 
file  should  cancel  all  outstanding  hints  on  the  destination  file  after  per¬ 
forming  a  SIO_CTL_FastCopy  operation  and  then  reissue  hints  as 
appropriate. 

Result  Values 

SIO_SUCCESS 

The  function  succeeded. 

SIO_ERR_INVALID_DESCRIPTOR 

The  file  descriptor  for  the  source  file  is  invalid. 
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SIO_ERR_NO_SPACE 

There  isn’t  enough  free  space  to  perform  a  fast  copy. 

SIO_ERR_OP_UNSUPPORTED 

Fast  copy  is  not  supported  by  the  implementation  for  files  with 
the  attributes  of  the  source  file  and/or  destination  file. 
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A  Result  codes  (for  sio_return_t) 


This  appendix  describes  some  error  and  return  codes  that  the  parallel  file 
system  may  wish  to  return.  As  discussed  in  the  Data  Types  section,  imple¬ 
mentors  should  feel  free  to  add  whatever  additional  codes  they  see  fit,  and 
should  make  sio_error_string()  aware  of  them. 

SIO_SUCCESS 

The  operation  completed  successfully.  The  value  of  SIO_SUCCESS 
must  always  be  0. 

SIO_ERR_ALREADY_EXISTS 

The  file  name  to  be  created  already  exists. 

SIO_ERR-CONTROL_FAILED 

One  or  more  of  the  control  operations  requested  by  sio_control(), 
sio_open(),  or  sio_test()  was  unsuccessful. 

SIO_ERR-CONTROL_NOT_ATTEMPTED 

A  control  operation  requested  by  sio_control(),  sio_open(),  or 
sio_test()  was  not  attempted. 

SIO_ERR_CONTROL_NOT_ON_TEST 

The  control  operation  cannot  be  used  with  sio_test(). 

SIO_ERR_CONTROL_WOULD_HAVE_SUCCEEDED 

The  control  operation  would  have  succeeded  but  the  function  perform¬ 
ing  the  control  failed. 

SIO_ERR_CONTROLS_CLASH 

The  list  of  controls  contains  combinations  of  operations  that  are  in¬ 
compatible. 

SIO_ERR_FILE_NOT_FOUND 

The  specified  file  did  not  exist. 

SIO_ERR_FILE_OPEN 

The  operation  failed  because  the  file  was  open. 
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SIO_ERRJNCORRECT_MODE 

The  mode  of  the  file  descriptor  does  not  permit  the  operation  or  func¬ 
tion. 

SIO_ERRJNVALID_CLASS 

The  hint  class  is  not  valid. 

SIO_ERRJNVALID_DESCRIPTOR 

A  file  descriptor  argument  was  not  a  valid  parallel  file  descriptor. 

SIO_ERRJNVALID_EXTENSION 

An  invalid  extension  identifier  was  given,  or  the  indicated  extension  is 
not  supported. 

SIO_ERRJNVALID_FILEXIST 

The  file  list  argument  is  invalid  (e.g  contains  illegal  offsets). 

SIO_ERRJNVALID  .FILENAME 

A  file  name  argument  did  not  contain  a  legal  file  name  (e.g.  it  was  too 
long). 

SIO_ERRJNVALID_HANDLE 

A  handle  argument  does  not  contain  a  valid  handle. 

SIO_ERRJNVALID  .ITERATION 

The  iteration  argument  is  invalid. 

SIO.ERRJNVALID.MEMORY.LIST 

The  memory  list  argument  is  invalid  (e.g.  contains  an  illegal  address). 

SIO_ERRJNVALID  .PARTICIPANT 

The  participant  number  provided  is  not  valid  because  it  is  greater  than 
the  number  of  participants  specified  when  the  collective  I/O  was  cre¬ 
ated. 

SIO.ERRJO_CANCELED 

An  asynchronous  I/O  did  not  complete  because  it  was  canceled  while 
in  progress. 
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SIO_ERRJO  .FAILED 

A  physical  I/O  error  occurred. 
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SIOJERRJOJN-PROGRESS 

An  asynchronous  I/O  has  not  yet  completed. 

SIO_ERR_MAX_ASYNC_OUTSTANDING_EXCEEDED 

The  I/O  request  could  not  be  initiated  because  doing  so  would  cause 
the  calling  task’s  number  of  outstanding  asynchronous  I/Os  to  exceed 
the  limit. 

SIO_ERR_MAX_COLL_ITERATIONS_EXCEEDED 

The  number  of  iterations  specified  for  a  collective  I/O  exceeds  the  limit. 

SIOJERRJVIAX_COLL_OUTSTANDING_EXCEEDED 

The  I/O  request  could  not  be  initiated  because  doing  so  would  cause 
the  calling  task’s  number  of  outstanding  collective  I/O’s  to  exceed  the 
limit. 

SIO_ERR_MAX_COLLJPARTICIPANTS_EXCEEDED 

The  number  of  participants  specified  for  a  collective  I/O  exceeds  the 
limit. 

SIO_ERR-MAX_OPEN_EXCEEDED 

The  file  could  not  be  opened  because  doing  so  would  cause  the  calling 
task’s  number  of  open  files  to  exceed  the  limit. 

SIO_ERR_MIXED_COLL_AND_ASYNC 

The  implementation  does  allow  asynchronous  I/O  handles  created  by 
sio_coll_define()  to  be  passed  to  functions  in  the  same  list  as  handles 
from  sio_async_sg_read()  and  sio_async_sg_write(). 

SIO_ERR_NO_SPACE 

An  operation  that  would  allocate  more  storage  to  a  file  failed  because 
no  storage  could  be  allocated. 

SIOJERR_ONLY_AT_CREATE 

The  control  operation  may  only  be  specified  during  a  call  to  sio_open() 
which  is  creating  a  file. 

SIO_ERR_ONLY_AT_OPEN 

The  control  operation  may  only  be  specified  during  a  call  to 

sio_open(). 
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2478  SIO_ERR_OP_UNSUPPORTED 

2479  The  parallel  file  system  has  elected  to  not  support  this  interface.  Note 

2480  that  some  interfaces  may  not  be  supported,  but  implementations  can 

2481  choose  to  return  SIO.SUCCESS  for  all  cases  instead. 

2482  SIO_ERR-UNEQUAL_LISTS 

The  number  of  bytes  in  the  memory  and  file  lists  arguments  to  an  I/O 
operation  are  not  the  same. 
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2486  B  Sample  Derived  Interfaces 

2486  This  section  describes  some  simple  interfaces  which  could  easily  be  created 

2487  using  the  interfaces  provided  by  this  API.  These  derived  interfaces  are  not  a 

2488  part  of  this  API,  and  are  intended  only  as  examples  of  interfaces  which  could 

2489  be  provided  by  high  level  libraries. 

2490  If  a  high  level  library  provides  interfaces  similar  (or  identical)  to  the  sample 

2491  interfaces  presented  here,  those  interfaces  should  be  named  in  accordance 

2492  with  the  rest  of  the  interfaces  provided  by  that  library.  In  other  words,  use 

2493  of  the  names  given  here  is  strongly  discouraged. 
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B.l  Synchronous  I/O 

Routines 

sio_return_t  sample_read(int  FileDescriptor, 

sio_addr_t  BufferPointer, 
sio_ofFset_t  Offset^  sio_size_t  Count, 
sio_transfer_len_t  *BytesRead)’, 

sio_return_t  sample_write(int  FileDescriptor, 

sio_addr_t  BufferPointer, 
sio_offset_t  Offset,  sio_size_t  Count, 
sio-transfer  Jen_t  *  Bytes  Written) ; 

sio_return_t  sample_readJoJist(int  FileDescriptor, 

sio_addr_t  BufferPointer, 
sio_fileJo_list_t  *FileList, 
sio_count_t  FileListLength, 
sio_transfer_len_t  * BytesRead)', 

sio_return_t  sample_writeJoJist(int  FileDescriptor, 

sio_addr_t  BufferPointer, 
sio_file_ioJist_t  *FileList, 
sio_count_t  FileListLength, 
sio_transfer_len_t  * BytesWritten); 

sio_return_t  sample_read_memJist(int  FileDescriptor, 

sio_memJoJist_t  *MemoryList, 
sio_count_t  Memory ListLength 
sio_offset_t  Offset, 
sio_transfer_len_t  *  BytesRead)] 

sio_return_t  sample_write_memJist(int  FileDescriptor, 

sio_mem joJist_t  *MemoryList, 
sio_count_t  MemoryListLength 
sio_offset_t  Offset, 
sio_transfer  Jen_t  *  Bytes  Written) ; 


Parameters 
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FileDescriptor  The  file  descriptor  of  an  open  parallel  file. 
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2535 


BufferPointer  Memory  address  of  contiguous  buffer  containing  data  to 
be  written  or  to  contain  data  being  read. 

Offset  Starting  file  offset  from  which  to  read  or  at  which  to  write. 
Count  Number  of  bytes  to  read  or  write. 

BytesRead  Number  of  bytes  actually  read. 

BytesWritten  Number  of  bytes  actually  written. 

FileList  Description  of  strided  regions  within  the  file. 

FileListLength  Number  of  valid  elements  to  use  in  FileList. 
MemoryList  Description  of  strided  regions  within  the  memory  buffer. 
Memory ListLength  Number  of  valid  elements  to  use  in  MemoryList. 
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Description 

These  functions  would  provide  a  simplified  synchronous  I/O  interface. 
They  may  be  implemented  as  wrappers  which  would  convert  the  given 
arguments  into  sio_mem  JoJist.t  and  sio_fileJo  Jist_t  structures  (as 
necessary)  and  invoke  sio_sg_read()  or  sio_sg_write(). 

The  functions  sample_read()  and  sample_write()  would  transfer 
data  between  a  single  contiguous  memory  buffer  and  a  single  con¬ 
tiguous  region  of  the  file.  The  functions  sample jread_io_list() 
and  sample_writeJoJist()  would  use  a  single  contiguous  mem¬ 
ory  buffer,  but  a  strided  region  within  the  file.  Similarly, 
sample_read_memJist()  and  sample_write_memJist()  would  use 
a  contiguous  file  region,  but  a  strided  region  within  the  memory  buffer. 
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B.2  Asynchronous  I/O 

Routines 

sio_return_t  sample^sync_read(int  FileDescriptor, 

sio_addr_t  BufferPointer, 
sio_ofFset_t  Offset, 
sio_size_t  Count, 
sio_async_handle_t  *  Handle); 

sio_return_t  sample^sync_write(int  FileDescriptor, 

sio_addr_t  BufferPointer, 
sio_offset_t  Offset, 
sio_size_t  Count, 
sio_async_handle_t  *  Handle); 

sio_return_t  sample_async_read_ioJist(int  FileDescriptor, 

sio_addr_t  BufferPointer, 
sio_file_ioJist_t  *FileList, 
sio_count_t  FileListLength 
sio_async_handle_t  *  Handle); 

sio_return_t  sample_async_writeJo_list(int  FileDescriptor, 

sio_addr_t  BufferPointer, 
sio_fileJoJist_t  *FileList, 
sio_count_t  FileListLength 
sio_async_handle_t  *  Handle); 

sio_return_t  sample_async_read_memJist(int  FileDescriptor, 

sio_memJoJist_t  *  Memory  List, 
sio_count_t  MemoryListLength 
sio_ofFset_t  Offset, 
sio^sync_handle_t  *  Handle); 

sio_return_t  sample^sync_write_niemJist(int  FileDescriptor, 

sio_memJo_list_t  * MemoryList, 
sio_count_t  MemoryListLength 
sio_offset_t  Offset, 
sio_async_handle_t  *  Handle); 
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FileDescriptor  The  file  descriptor  of  an  open  parallel  file. 

BufferPointer  Memory  address  of  contiguous  buffer  containing  data  to 
be  written  or  to  contain  data  being  read. 

Offset  Starting  file  offset  from  which  to  read  or  at  which  to  write. 

Count  Number  of  bytes  to  read  or  write. 

BytesRead  Number  of  bytes  actually  read. 

BytesWritten  Number  of  bytes  actually  written. 

FileList  Description  of  strided  regions  within  the  file. 

FileListLength  Number  of  valid  elements  to  use  in  FileList. 

MemoryList  Description  of  strided  regions  within  the  memory  buffer. 

Memory ListLength  Number  of  valid  elements  to  use  in  MemoryList. 

Handle  Handle  for  asynchronous  I/O  that  can  later  be  used  to  test  its 
status. 
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Description 

These  routines  would  provide  a  simplified  asynchronous  I/O  interface. 
They  may  be  implemented  as  wrappers  which  would  convert  the  given 
arguments  into  sio_mem  Jo  Jist_t  and  sio_file  Jo  Jist_t  structures  (as 
necessary)  and  invoke  sio_async_sg_read()  or  sio_async_sg_write(). 

These  functions  would  take  arguments  similar  to  those  given  to  the 
simplified  synchronous  functions,  and  perform  similar  actions. 
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B.3  Cache  Consistency 


2602  Functions 


2603  sio_return_t  sample_propagate(int  FileDescriptor, 

2604  sio_ofFset_t  Offset, 

2605  sio_size_t  Length)] 

2606  sio_return_t  sample_refresh(int  FileDescriptor, 

2607  sio_offset_t  Offset, 

2608  sio_size_t  Length)] 

2609  Parameters 

2610  FileDescriptor  File  descriptor  to  which  cache  consistency  action  ap- 

2611  plies. 

2612  Offset  Starting  file  offset  affected  by  consistency  action. 

2613  Length  Number  of  bytes  affected  by  consistency  action. 

2614  Description 

2615  These  functions  would  perform  cache  consistency  actions  on  the  speci- 

2616  fied  region  of  the  file  associated  with  the  given  file  descriptor.  It  may 

2617  be  implemented  as  wrappers  which  would  invoke  sio_control()  to  per- 

2618  form  the  appropriate  SIO_CTL_Propagate  or  SIO_CTL_Refresh 

2619  operation. 


